Nutch 0.9 already extracts the properties in MSExtractor.java and MSBaseParser puts them into the MetaData class.

I'm not using Nutch in its entirety, only the parsing framework, but I am indexing the document properties quite happily from MS documents. I also wrote a new parser for Office 2007, using POI 3.5 and that is also getting the properties in a similar way. Is the problem at a higher level in that Nutch is not indexing the MetaData?

Antony




Doğacan Güney wrote:
On Fri, Jan 30, 2009 at 9:15 PM, ahammad <[email protected]> wrote:
Hello,

I've been looking further into this and it seems like the only way to do it
is to modify the msword parser so that it reads in the custom properties
information. I'm attempting this but so far, I wasn't successful.

The classes that I found that may be useful are
org.apache.poi.hpsf.DocumentSummaryInformation and
org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
that I need.

I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
my time?

Anybody has any other suggestions? This seems like it'll be a lot of work
with a very small chance of success. Any alternative methods would be nice.


No, you are doing the right thing. Alternatively, if you know of a
good java library
for extracting the information you are looking for; you can write your
own parse-ms
plugin as well.

Extract any metadata you want and put them in parse data metadata. You can then
read them during indexing and add them to your index.

Thanks a lot.

Cheers


Reply via email to