Hi, On 7/10/07, Schuh, Stefan <[EMAIL PROTECTED]> wrote:
I am looking for a text extractor (tool set) which could be used, to get text data out of several file formats like office documents and so on. The text data (extract) could then be used to index with lucene. Best would be a java api, but not required. Does any one have knowledge of such a tool set or project?
The Tika project [1] in the Apache Incubator is currently getting started at implementing such a generic toolkit. Unfortunately we haven't yet released anything. You may also want to check out the Lius project [2] that is one of the source codebases to be used in Tika. Another potential match is the Aperture project [3]. [1] http://incubator.apache.org/tika/ [2] http://sourceforge.net/projects/lius/ [3] http://aperture.sourceforge.net/ BR, Jukka Zitting
