we have been looking into the StarOffice/OpenOffice problem, although we haven't done it and probably won't anytime soon as we have to move on to other things. I see two approaches, both with variants:
(1) use the fact that it is just zipped XML: use a ZipInputStream to open the files and parse the XML contents. You can use the standard approaches for parsing XML or you can tweak it. It might be worthwhile to look only at the contents part for the body indexing and try to be smart about the metadata information while ignoring the layout bits. Should be relatively easy since all these parts are in separate files.
(2) use the UDK (http://udk.openoffice.org/). Drawbacks: even though it seems documented its sheer size will make it a bit hard to get into. You will also have to deploy a large library which is not pure Java. Advantages: you will not only get the SO/OOo documents as good as the programs parse them themself, but also everything they can import. And that will be way better than anything we could get so far for Word documents. A UDK-based document parser would most likely be the killer for enterprise document indexing -- you wouldn't need much more if anything at all.
We might actually still go for (1) since that is really easy, but we don't have the time for (2). Although we'd love to have it, so if you go for it tell us :-)
HTH, Peter
Oscar Herrera wrote:
Hi. I'd like to know if some of you could help me finding a spanish analyzer (free if possible). I'd also like to know how can I index a file made on StarOffice 5.x (.sdw and .sdx files), I've been looking on google for them but I have not found anything about this,
Thank you in advance for your collaboration,
Oscar Herrera Bogot�, Colombia, SA.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
