HI Folks, I was looking at the Lucene FAQ and I found this very interesting. How can I index OpenOffice.org files?
These files (.sxw, .sxc, etc) are ZIP archives that contain XML files. Uncompress the file using Java's ZIP support, then parse meta.xml to get title etc. and content.xml to get the document's content. Add these to the Lucene index, typically using one Lucene field per property. Note that this applies to OpenOffice.org 1.x, things have changed a bit for OpenOffice.org 2.x, but the basic approach is still the same. You can also use LIUS framework for indexing OpenOffice<http://wiki.apache.org/lucene-java/OpenOffice>documents([image: [WWW]] http://www.bibl.ulaval.ca/lius/ <http://www.bibl.ulaval.ca/lius/>). LIUS allow metadata and fulltext indexing, using XPath. But the problem is that I was not able to find more information on http://www.bibl.ulaval.ca/lius/ Had any one had better luck on finding more information on Using Luis ?. Also please suggest any alternatives if Luis is no longer available. We have the following documents PDF / MS Documents etc.. in the pipeline that needs to be indexed Thanks Much -DD