Hello,

I've been looking further into this and it seems like the only way to do it
is to modify the msword parser so that it reads in the custom properties
information. I'm attempting this but so far, I wasn't successful.

The classes that I found that may be useful are
org.apache.poi.hpsf.DocumentSummaryInformation and
org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
that I need.

I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
my time?

Anybody has any other suggestions? This seems like it'll be a lot of work
with a very small chance of success. Any alternative methods would be nice.

Thanks a lot.

Cheers



ahammad wrote:
> 
> I have successfully gotten Nutch to index msword documents. If you go
> under File>Properties, and under the "Custom" tab in MS Word, you can add
> some properties to the file, sort of like HTML meta tags.
> 
> I have the msword parser, index-more and query-more plugins, as well as a
> custom meta tag indexer/filter installed. My question is can Nutch read
> document properties like the ones I described? Does it have the ability to
> go that far in the document to extract the custom user-defined properties?
> 
> If so, was there anybody that successfully implemented this? If not, I
> would imagine that we need to modify index-more/query-more plugins to do
> that. Can someone confirm this?
> 
> Anyone know of a good place to start looking? Any help will be
> appreciated.
> 
> Cheers.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21753762.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to