Date: 2004-12-30T13:19:03 Editor: DanielNaber Wiki: Jakarta Lucene Wiki Page: LuceneFAQ URL: http://wiki.apache.org/jakarta-lucene/LuceneFAQ
no comment Change Log: ------------------------------------------------------------------------------ @@ -445,6 +445,17 @@ See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, indexing, and searching XML with Digester and Lucene]. +==== How can I index OpenOffice.org files? ==== + +These files (.sxw, .sxc, etc) are ZIP archives that contain XML files. Uncompress +the file using Java's ZIP support, then parse meta.xml to get title etc. +and content.xml to get the document's content. Add these to the Lucene index, +typically using one Lucene field per property. + +Note that this applies to OpenOffice.org 1.x, things might change a bit for OpenOffice.org +2.x, but the basic approach will still be the same. + + ==== How can I index MS-Word documents? ==== In order to index Word documents you need to first parse them to extract text that you want to index from them. Here are some Word parsers that can help you with that: --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]