Thanks to all that have done a reply to my question. Send regards,
Reyna 2012/1/11 Michael Wechner <michael.wech...@wyona.com> > Maybe Tika is also of help to you > > http://tika.apache.org/ > > HTH > > Michael > > Am 11.01.12 20:13, schrieb Reyna Melara: > >> Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a >> set >> of 11,051,447 files with txt extension but the content of each file is in >> fact in wiki format, I want and I need them to be indexed, but I don't >> know >> if I have to convert this content to flat text, I have been reading and I >> have found that: >> >> "At the core of Lucene's logical architecture is the idea of a *document* >> containing *fields* of text. This flexibility allows Lucene's API to be >> >> independent of the file >> format<http://en.wikipedia.**org/wiki/File_format<http://en.wikipedia.org/wiki/File_format> >> >. >> Text from >> PDFs<http://en.wikipedia.org/**wiki/Portable_Document_Format<http://en.wikipedia.org/wiki/Portable_Document_Format> >> >**, >> HTML<http://en.wikipedia.org/**wiki/HTML<http://en.wikipedia.org/wiki/HTML> >> > >> , Microsoft >> Word<http://en.wikipedia.org/**wiki/Microsoft_Word<http://en.wikipedia.org/wiki/Microsoft_Word>>, >> and >> OpenDocument<http://en.**wikipedia.org/wiki/**OpenDocument<http://en.wikipedia.org/wiki/OpenDocument>> >> documents, as well >> >> as many others (except images), can all be indexed as long as their >> textual >> information can be extracted." >> >> So, I guess there's no problem if I leave the files just like they are >> already. >> >> My question about would be: Do I get the same results and advantages of >> this files? Will it be good? >> >> Thanks a lot, send best regards. >> >> >> > > ------------------------------**------------------------------**--------- > To unsubscribe, e-mail: > java-user-unsubscribe@lucene.**apache.org<java-user-unsubscr...@lucene.apache.org> > For additional commands, e-mail: > java-user-help@lucene.apache.**org<java-user-h...@lucene.apache.org> > > -- Reyna