Looking to Index Various Document Types.

DURGA DEEP Wed, 12 Mar 2008 13:25:36 -0700

 HI Folks,

I was looking at the Lucene FAQ and I found this very interesting.
How can I index OpenOffice.org files?


These files (.sxw, .sxc, etc) are ZIP archives that contain XML files.
Uncompress the file using Java's ZIP support, then parse meta.xml to get
title etc. and content.xml to get the document's content. Add these to the
Lucene index, typically using one Lucene field per property.

Note that this applies to OpenOffice.org 1.x, things have changed a bit for
OpenOffice.org 2.x, but the basic approach is still the same.

You can also use LIUS framework for indexing
OpenOffice<http://wiki.apache.org/lucene-java/OpenOffice>documents([image:
[WWW]] http://www.bibl.ulaval.ca/lius/ <http://www.bibl.ulaval.ca/lius/>).
LIUS allow metadata and fulltext indexing, using XPath.

But the problem is that I was not able to find more information on
http://www.bibl.ulaval.ca/lius/
Had any one had better luck on finding more information on Using Luis ?.
Also please suggest any alternatives if Luis is no longer available.
We have the following documents PDF / MS Documents etc.. in the pipeline
that needs to be indexed

Thanks Much
-DD

Looking to Index Various Document Types.

Reply via email to