Indexed data is coming out in the same way as put in. Lucene works with Java Strings, so encoding is irrelevant. When you index your values, you must be sure, to construct your index string/char arrays correctly using the UTF-8 encoding (e.g. by using a standard Java Reader, new String byte[], charset) and so on. When you then print stored fields you must do the same in the other direction. So the general rule: Always specify the correct charset when converting to/from strings to bytes. For searching: It roughly also depends also on the Analyzer used during indexing and searching. Often analyzers written for specific languages cannot correctly handle characters from foreign languages. But e.g. StandardAnalyzer or WhitespaceAnalyzer does not modify the tokens in any way (if making them lowercase is not a problem).
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: KK [mailto:dioxide.softw...@gmail.com] > Sent: Thursday, May 21, 2009 3:25 PM > To: java-user@lucene.apache.org > Subject: Posting unicode data to lucene not working during > searching/retreival! > > How to post utf-8 unicoded data to lucene index. Do we have to specify > something special, any sort of flag saying that we're posting unicoded > data? > I tried to post some utf-8 encoded data, during retrieval I'm not able to > see those data , there are just "?" marks in all those places. Earlier I > was > using Solr and I was posting using the same method and retreival was also > working fine, but I dont' know what is the issue with lucene, may be I'm > missing something. Can someone tell me what could be the issue? Thank you. > > KK, --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org