Maybe Tika is also of help to you http://tika.apache.org/
HTH Michael Am 11.01.12 20:13, schrieb Reyna Melara:
Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a set of 11,051,447 files with txt extension but the content of each file is in fact in wiki format, I want and I need them to be indexed, but I don't know if I have to convert this content to flat text, I have been reading and I have found that: "At the core of Lucene's logical architecture is the idea of a *document* containing *fields* of text. This flexibility allows Lucene's API to be independent of the file format<http://en.wikipedia.org/wiki/File_format>. Text from PDFs<http://en.wikipedia.org/wiki/Portable_Document_Format>, HTML<http://en.wikipedia.org/wiki/HTML> , Microsoft Word<http://en.wikipedia.org/wiki/Microsoft_Word>, and OpenDocument<http://en.wikipedia.org/wiki/OpenDocument> documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted." So, I guess there's no problem if I leave the files just like they are already. My question about would be: Do I get the same results and advantages of this files? Will it be good? Thanks a lot, send best regards.
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org