Unfortunately, it is not quite so easy. I am not sure about Word documents but PDFs usually have there contents compressed so a raw "fishing" around for text would be pointless. Your best bet is to use a package like the one from textmining.org that handles various formats for you.
Ben On Thu, 30 Oct 2003, petite_abeille wrote: > Hello, > > Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a > popular question on this list... > > The traditional approach seems to be to try to find some kind of format > specific reader to properly extract the textual part of such documents > for indexing. The drawback of such an approach is that its complicated > and cumborsome: many different formats, not that many Java libraries to > understand them all. > > An alternative to such a mess could be perhaps to convert those > multitude of formats into something more or less standard and then > extract the text from that. But again, this doesn't seem to be such a > straightforward proposition. For example, one could image "printing" > every document to PDF and then convert the resulting PDF to text. Not a > piece of cake in Java. > > Finally, a while back, somebody on this list mentioned quiet a > different approach: simply read the raw binary document and go fishing > for what looks like text. I would like to try that :) > > Does anyone remember this proposal? Has anyone tried such an approach? > > Thanks for any pointers. > > Cheers, > > PA. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
