Seems like my previous message never went through.

The Nutch msword parser does index _some_ metadata. If you go into
File>Properties and under the Summary tab (in Microsoft Word), that metadata
is indexed (like author, company etc.). However, you can add custom
properties (File>Properties under the Custom tab) to any Word document. That
metadata is not indexed.

As an example, I have a set of files that have some information relating to
product types. In those files, there is a custom property called
productType, which can contain values like fax, printer, monitor etc.

What I want to be able to do is to index those files so I can be able to
search on the product type. For instance, if I put "canon
+productType:printer", I'll get only the documents that have to do with
Canon printers. I already have a query filter in place that can do that,
it's just a matter of getting the productType custom property in the index.

The POI parser that you wrote, does it have the ability to parse custom
properties from Microsoft Word documents?

It didn't, but I just added it - it was trivial. I'm using POI 3.5 and my parser is doing something like

    byte[] raw = content.getContent();
POITextExtractor extractor = ExtractorFactory.createExtractor(new ByteArrayInputStream(raw));
    text = extractor.getText();
    if (POIOLE2TextExtractor.class.isAssignableFrom(extractor.getClass()))
    {
        properties = getOLE2MetaData((POIOLE2TextExtractor)extractor);
    }
    else if (POIXMLTextExtractor.class.isAssignableFrom(extractor.getClass()))
    {
        properties = getXMLMetaData((POIXMLTextExtractor)extractor);
    }

I just tried getting custom properties from the OLE2 text extractor, which is based on the MSExtractor implementation

    private Properties getOLE2MetaData(POIOLE2TextExtractor extractor)
    {
        Properties props = new Properties();
        SummaryInformation si = extractor.getSummaryInformation();
...
        DocumentSummaryInformation dsi = extractor.getDocSummaryInformation();
        CustomProperties cp = dsi.getCustomProperties();
        Iterator i = cp.keySet().iterator();
        while (i.hasNext())
        {
            String name = (String)i.next();
            setProperty(props, name, cp.get(name).toString());
        }
        return props;
    }

This works nicely. I didn't try the XML variant, but I guess that would be pretty similar.
Antony




Reply via email to