On Fri, Jan 30, 2009 at 9:15 PM, ahammad <[email protected]> wrote: > > Hello, > > I've been looking further into this and it seems like the only way to do it > is to modify the msword parser so that it reads in the custom properties > information. I'm attempting this but so far, I wasn't successful. > > The classes that I found that may be useful are > org.apache.poi.hpsf.DocumentSummaryInformation and > org.apache.poi.hpsf.CustomProperties. Not sure if there are other things > that I need. > > I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the > lib-parsems plugin. Am I proceeding correctly with this or am I just wasting > my time? > > Anybody has any other suggestions? This seems like it'll be a lot of work > with a very small chance of success. Any alternative methods would be nice. >
No, you are doing the right thing. Alternatively, if you know of a good java library for extracting the information you are looking for; you can write your own parse-ms plugin as well. Extract any metadata you want and put them in parse data metadata. You can then read them during indexing and add them to your index. > Thanks a lot. > > Cheers > > > > ahammad wrote: >> >> I have successfully gotten Nutch to index msword documents. If you go >> under File>Properties, and under the "Custom" tab in MS Word, you can add >> some properties to the file, sort of like HTML meta tags. >> >> I have the msword parser, index-more and query-more plugins, as well as a >> custom meta tag indexer/filter installed. My question is can Nutch read >> document properties like the ones I described? Does it have the ability to >> go that far in the document to extract the custom user-defined properties? >> >> If so, was there anybody that successfully implemented this? If not, I >> would imagine that we need to modify index-more/query-more plugins to do >> that. Can someone confirm this? >> >> Anyone know of a good place to start looking? Any help will be >> appreciated. >> >> Cheers. >> >> > > -- > View this message in context: > http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21753762.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
