Re: luke and chinese text

Andrzej Bialecki Thu, 22 Dec 2011 05:05:01 -0800

On 22/12/2011 13:50, Peyman Faratin wrote:

Hi


We are indexing some chinese text (using the following outputstreamwriter with 
UTF-8 enconding).

OutputStreamWriter outputFileWriter  = new OutputStreamWriter(new 
FileOutputStream(outputFile), "utf8");

using lucene 3.2. The analyzer is

new LimitTokenCountAnalyzer(new 
SmartChineseAnalyzer(Version.LUCENE_32,Stopwords),Integer.MAX_VALUE)

Hi

We are now trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8 option in 
Luke), but it seems to be garbled. We see a lot of "???". According to  
http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/decoders/StringDecoder.java

  issue should be in

   public String decodeTerm(String fieldName, Object value) {


     if (value == null) {


       return "(null)";


     } else if (value instanceof BytesRef) {


       return ((BytesRef)value).utf8ToString();


     } else {


       return value.toString();


     }
   }

In this function, the value should be instance of  BytesRef, then calling the
.utf8ToString() function will decode the BytesRef to java utf8 string. However, 
for unknown reason, for our index, the value is not BytesRef, I also tested it 
is not CharsRef.

Hmm, then what is it? Just add a println(value.getClass().getName()) andsee what it is.

So the toString() method is called on the value object and result is some ???.

I suspect that the issue could be with the display font - please selectfrom the Settings menu a font that supports Unicode characters, thedefault platform font often doesn't support them, which results in '?'or other strange characters.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: luke and chinese text

Reply via email to