Hi there,

just to let you know, i had implement for the nutch project a plugin that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.


It is really straight forward to use.

Found some info's and a link to the open source code here:
http://sourceforge.net/tracker/index.php?func=detail&aid=828517&group_id=59548&atid=491356

Feel free to recycle the code and give me any feedback.
Hope it will help to free some information from some strange commercial formats, since information should be free. ;)


Cheers
Stefan







Ben Litchfield wrote:

Unfortunately, it is not quite so easy.  I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless.  Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.

Ben


On Thu, 30 Oct 2003, petite_abeille wrote:




Hello,

Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
popular question on this list...

The traditional approach seems to be to try to find some kind of format
specific reader to properly extract the textual part of such documents
for indexing. The drawback of such an approach is that its complicated
and cumborsome: many different formats, not that many Java libraries to
understand them all.

An alternative to such a mess could be perhaps to convert those
multitude of formats into something more or less standard and then
extract the text from that. But again, this doesn't seem to be such a
straightforward proposition. For example, one could image "printing"
every document to PDF and then convert the resulting PDF to text. Not a
piece of cake in Java.

Finally, a while back, somebody on this list mentioned quiet a
different approach: simply read the raw binary document and go fishing
for what looks like text. I would like to try that :)

Does anyone remember this proposal? Has anyone tried such an approach?

Thanks for any pointers.

Cheers,

PA.


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]




--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]







--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to