Hello, I've been looking further into this and it seems like the only way to do it is to modify the msword parser so that it reads in the custom properties information. I'm attempting this but so far, I wasn't successful.
The classes that I found that may be useful are org.apache.poi.hpsf.DocumentSummaryInformation and org.apache.poi.hpsf.CustomProperties. Not sure if there are other things that I need. I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the lib-parsems plugin. Am I proceeding correctly with this or am I just wasting my time? Anybody has any other suggestions? This seems like it'll be a lot of work with a very small chance of success. Any alternative methods would be nice. Thanks a lot. Cheers ahammad wrote: > > I have successfully gotten Nutch to index msword documents. If you go > under File>Properties, and under the "Custom" tab in MS Word, you can add > some properties to the file, sort of like HTML meta tags. > > I have the msword parser, index-more and query-more plugins, as well as a > custom meta tag indexer/filter installed. My question is can Nutch read > document properties like the ones I described? Does it have the ability to > go that far in the document to extract the custom user-defined properties? > > If so, was there anybody that successfully implemented this? If not, I > would imagine that we need to modify index-more/query-more plugins to do > that. Can someone confirm this? > > Anyone know of a good place to start looking? Any help will be > appreciated. > > Cheers. > > -- View this message in context: http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21753762.html Sent from the Nutch - User mailing list archive at Nabble.com.
