Re: Text Extractor

Michael Levy Mon, 23 Jul 2007 05:57:48 -0700

It might be worthwhile for you to review Nutch, a web search application
based on Lucene that can also search local filesystems.  It includes parsers
for several common office type documents.


http://lucene.apache.org/nutch/



On 7/10/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:

Hi,

On 7/10/07, Schuh, Stefan <[EMAIL PROTECTED]> wrote:
> I am looking for a text extractor (tool set) which could be used, to get
> text data out of several file formats like office documents and so on.
> The text data (extract) could then be used to index with lucene.  Best
> would be a java api, but not required. Does any one have knowledge
> of such a tool set or project?

The Tika project [1] in the Apache Incubator is currently getting
started at implementing such a generic toolkit. Unfortunately we
haven't yet released anything.

You may also want to check out the Lius project [2] that is one of the
source codebases to be used in Tika. Another potential match is the
Aperture project [3].

[1] http://incubator.apache.org/tika/
[2] http://sourceforge.net/projects/lius/
[3] http://aperture.sourceforge.net/

BR,

Jukka Zitting

Re: Text Extractor

Reply via email to