Re: Indexing msword document properties

ahammad Wed, 04 Feb 2009 06:57:18 -0800

Seems like my previous message never went through.

The Nutch msword parser does index _some_ metadata. If you go into
File>Properties and under the Summary tab (in Microsoft Word), that metadata
is indexed (like author, company etc.). However, you can add custom
properties (File>Properties under the Custom tab) to any Word document. That
metadata is not indexed.


As an example, I have a set of files that have some information relating to
product types. In those files, there is a custom property called
productType, which can contain values like fax, printer, monitor etc.

What I want to be able to do is to index those files so I can be able to
search on the product type. For instance, if I put "canon
+productType:printer", I'll get only the documents that have to do with
Canon printers. I already have a query filter in place that can do that,
it's just a matter of getting the productType custom property in the index.

The POI parser that you wrote, does it have the ability to parse custom
properties from Microsoft Word documents?

Thank you for your reply.

Cheers



Antony Bowesman wrote:
> 
> Nutch 0.9 already extracts the properties in MSExtractor.java and
> MSBaseParser 
> puts them into the MetaData class.
> 
> I'm not using Nutch in its entirety, only the parsing framework, but I am 
> indexing the document properties quite happily from MS documents.  I also
> wrote 
> a new parser for Office 2007, using POI 3.5 and that is also getting the 
> properties in a similar way.  Is the problem at a higher level in that
> Nutch is 
> not indexing the MetaData?
> 
> Antony
> 
> 
> 
> 
> Doğacan Güney wrote:
>> On Fri, Jan 30, 2009 at 9:15 PM, ahammad <[email protected]> wrote:
>>> Hello,
>>>
>>> I've been looking further into this and it seems like the only way to do
>>> it
>>> is to modify the msword parser so that it reads in the custom properties
>>> information. I'm attempting this but so far, I wasn't successful.
>>>
>>> The classes that I found that may be useful are
>>> org.apache.poi.hpsf.DocumentSummaryInformation and
>>> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
>>> that I need.
>>>
>>> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in
>>> the
>>> lib-parsems plugin. Am I proceeding correctly with this or am I just
>>> wasting
>>> my time?
>>>
>>> Anybody has any other suggestions? This seems like it'll be a lot of
>>> work
>>> with a very small chance of success. Any alternative methods would be
>>> nice.
>>>
>> 
>> No, you are doing the right thing. Alternatively, if you know of a
>> good java library
>> for extracting the information you are looking for; you can write your
>> own parse-ms
>> plugin as well.
>> 
>> Extract any metadata you want and put them in parse data metadata. You
>> can then
>> read them during indexing and add them to your index.
>> 
>>> Thanks a lot.
>>>
>>> Cheers
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21832075.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Indexing msword document properties

Reply via email to