There is some ongoing work for nutch.org. May be we can bundle all work together?! <open source> Nutch has alraeady a java *.doc, *.pdf parser as well .
Stefan
Pete Lewis wrote:
Hi Stefan
Using OpenOffice will enable you to parse 182 file formats, but its not a pure java solution and you still need an alternate solution for pdfs.
I'd be interested in knowing whether anyone is working on a pure java solution that would give us a single method for handling ms office documments / pdfs / etc.
Cheers
Pete
----- Original Message ----- From: "Stefan Groschupf" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 05, 2003 10:26 AM
Subject: Re: Index entire filesystem
I had write to this list some days ago, to announce a possibility to parse 182 file formats. There was a tiny bug report some days ago, i hope i can fix it.
Browse the archive to figure out more.
Cheers Stefan
Marcel Stor wrote:
Hi all,
I'm thinkin' about writing a search tool for my filesystem. I know such things exist already but programming it myself is much more fun ;-) So, I would have Lucene crawl through my filesystem and pass each file to an appropriate indexer (PDF -> PDFbox, etc.). Yes, I run a Windows system and would depend on the file ending to distinguish the file type. Is this a good idea in general? Is there a list of available indexer for the the different file types? Any other comments are also welcome.
Regards, Marcel
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
