On Fri, Jan 30, 2009 at 9:15 PM, ahammad <[email protected]> wrote:
>
> Hello,
>
> I've been looking further into this and it seems like the only way to do it
> is to modify the msword parser so that it reads in the custom properties
> information. I'm attempting this but so far, I wasn't successful.
>
> The classes that I found that may be useful are
> org.apache.poi.hpsf.DocumentSummaryInformation and
> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
> that I need.
>
> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
> lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
> my time?
>
> Anybody has any other suggestions? This seems like it'll be a lot of work
> with a very small chance of success. Any alternative methods would be nice.
>

No, you are doing the right thing. Alternatively, if you know of a
good java library
for extracting the information you are looking for; you can write your
own parse-ms
plugin as well.

Extract any metadata you want and put them in parse data metadata. You can then
read them during indexing and add them to your index.

> Thanks a lot.
>
> Cheers
>
>
>
> ahammad wrote:
>>
>> I have successfully gotten Nutch to index msword documents. If you go
>> under File>Properties, and under the "Custom" tab in MS Word, you can add
>> some properties to the file, sort of like HTML meta tags.
>>
>> I have the msword parser, index-more and query-more plugins, as well as a
>> custom meta tag indexer/filter installed. My question is can Nutch read
>> document properties like the ones I described? Does it have the ability to
>> go that far in the document to extract the custom user-defined properties?
>>
>> If so, was there anybody that successfully implemented this? If not, I
>> would imagine that we need to modify index-more/query-more plugins to do
>> that. Can someone confirm this?
>>
>> Anyone know of a good place to start looking? Any help will be
>> appreciated.
>>
>> Cheers.
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21753762.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Reply via email to