You might be interested in looking at ManifoldCF for getting your documents 
into Solr.  See http://incubator.apache.org/connectors for more details.

Karl


-----Original Message-----
From: ext Reyna Melara [mailto:reynamel...@gmail.com] 
Sent: Wednesday, January 11, 2012 2:13 PM
To: java-user@lucene.apache.org
Subject: is it possible to index wiki markup files?

Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a set of 
11,051,447 files with txt extension but the content of each file is in fact in 
wiki format, I want and I need them to be indexed, but I don't know if I have 
to convert this content to flat text, I have been reading and I have found that:

"At the core of Lucene's logical architecture is the idea of a *document*  
containing *fields* of text. This flexibility allows Lucene's API to be 
independent of the file format <http://en.wikipedia.org/wiki/File_format>.
Text from PDFs <http://en.wikipedia.org/wiki/Portable_Document_Format>,
HTML<http://en.wikipedia.org/wiki/HTML>
, Microsoft Word <http://en.wikipedia.org/wiki/Microsoft_Word>, and 
OpenDocument <http://en.wikipedia.org/wiki/OpenDocument> documents, as well as 
many others (except images), can all be indexed as long as their textual 
information can be extracted."

So, I guess there's no problem if I leave the files just like they are already.

My question about would be: Do I get the same results and advantages of this 
files? Will it be good?

Thanks a lot, send best regards.


--
Reyna

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to