Hi,

On 7/10/07, Schuh, Stefan <[EMAIL PROTECTED]> wrote:
I am looking for a text extractor (tool set) which could be used, to get
text data out of several file formats like office documents and so on.
The text data (extract) could then be used to index with lucene.  Best
would be a java api, but not required. Does any one have knowledge
of such a tool set or project?

The Tika project [1] in the Apache Incubator is currently getting
started at implementing such a generic toolkit. Unfortunately we
haven't yet released anything.

You may also want to check out the Lius project [2] that is one of the
source codebases to be used in Tika. Another potential match is the
Aperture project [3].

[1] http://incubator.apache.org/tika/
[2] http://sourceforge.net/projects/lius/
[3] http://aperture.sourceforge.net/

BR,

Jukka Zitting

Reply via email to