Hi Reyna, I have never used it, but there is a WikipediaTokenizer defined in the analyzer contrib:
http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/wikipedia/WikipediaTokenizer.html You can find a test case for this tokenizer in the source code. Hopefully others will have been suggestions. Cheers, Ivan On Wed, Jan 11, 2012 at 11:13 AM, Reyna Melara <reynamel...@gmail.com> wrote: > Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a set > of 11,051,447 files with txt extension but the content of each file is in > fact in wiki format, I want and I need them to be indexed, but I don't know > if I have to convert this content to flat text, I have been reading and I > have found that: > > "At the core of Lucene's logical architecture is the idea of a *document* > containing *fields* of text. This flexibility allows Lucene's API to be > independent of the file format <http://en.wikipedia.org/wiki/File_format>. > Text from PDFs <http://en.wikipedia.org/wiki/Portable_Document_Format>, > HTML<http://en.wikipedia.org/wiki/HTML> > , Microsoft Word <http://en.wikipedia.org/wiki/Microsoft_Word>, and > OpenDocument <http://en.wikipedia.org/wiki/OpenDocument> documents, as well > as many others (except images), can all be indexed as long as their textual > information can be extracted." > > So, I guess there's no problem if I leave the files just like they are > already. > > My question about would be: Do I get the same results and advantages of > this files? Will it be good? > > Thanks a lot, send best regards. > > > -- > Reyna --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org