just to let you know, i had implement for the nutch project a plugin that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.
It is really straight forward to use.
Found some info's and a link to the open source code here: http://sourceforge.net/tracker/index.php?func=detail&aid=828517&group_id=59548&atid=491356
Feel free to recycle the code and give me any feedback.
Hope it will help to free some information from some strange commercial formats, since information should be free. ;)
Cheers Stefan
Ben Litchfield wrote:
Unfortunately, it is not quite so easy. I am not sure about Word documents but PDFs usually have there contents compressed so a raw "fishing" around for text would be pointless. Your best bet is to use a package like the one from textmining.org that handles various formats for you.
Ben
On Thu, 30 Oct 2003, petite_abeille wrote:
Hello,
Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a popular question on this list...
The traditional approach seems to be to try to find some kind of format specific reader to properly extract the textual part of such documents for indexing. The drawback of such an approach is that its complicated and cumborsome: many different formats, not that many Java libraries to understand them all.
An alternative to such a mess could be perhaps to convert those multitude of formats into something more or less standard and then extract the text from that. But again, this doesn't seem to be such a straightforward proposition. For example, one could image "printing" every document to PDF and then convert the resulting PDF to text. Not a piece of cake in Java.
Finally, a while back, somebody on this list mentioned quiet a different approach: simply read the raw binary document and go fishing for what looks like text. I would like to try that :)
Does anyone remember this proposal? Has anyone tried such an approach?
Thanks for any pointers.
Cheers,
PA.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
