It might be worthwhile for you to review Nutch, a web search application based on Lucene that can also search local filesystems. It includes parsers for several common office type documents.
http://lucene.apache.org/nutch/ On 7/10/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:
Hi, On 7/10/07, Schuh, Stefan <[EMAIL PROTECTED]> wrote: > I am looking for a text extractor (tool set) which could be used, to get > text data out of several file formats like office documents and so on. > The text data (extract) could then be used to index with lucene. Best > would be a java api, but not required. Does any one have knowledge > of such a tool set or project? The Tika project [1] in the Apache Incubator is currently getting started at implementing such a generic toolkit. Unfortunately we haven't yet released anything. You may also want to check out the Lius project [2] that is one of the source codebases to be used in Tika. Another potential match is the Aperture project [3]. [1] http://incubator.apache.org/tika/ [2] http://sourceforge.net/projects/lius/ [3] http://aperture.sourceforge.net/ BR, Jukka Zitting
