Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a popular question on this list...
The traditional approach seems to be to try to find some kind of format specific reader to properly extract the textual part of such documents for indexing. The drawback of such an approach is that its complicated and cumborsome: many different formats, not that many Java libraries to understand them all.
An alternative to such a mess could be perhaps to convert those multitude of formats into something more or less standard and then extract the text from that. But again, this doesn't seem to be such a straightforward proposition. For example, one could image "printing" every document to PDF and then convert the resulting PDF to text. Not a piece of cake in Java.
Finally, a while back, somebody on this list mentioned quiet a different approach: simply read the raw binary document and go fishing for what looks like text. I would like to try that :)
Does anyone remember this proposal? Has anyone tried such an approach?
Thanks for any pointers.
Cheers,
PA.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
