luke and chinese text

Peyman Faratin Thu, 22 Dec 2011 04:50:56 -0800

Hi 

We are indexing some chinese text (using the following outputstreamwriter with 
UTF-8 enconding).


OutputStreamWriter outputFileWriter  = new OutputStreamWriter(new 
FileOutputStream(outputFile), "utf8");

using lucene 3.2. The analyzer is

new LimitTokenCountAnalyzer(new 
SmartChineseAnalyzer(Version.LUCENE_32,Stopwords),Integer.MAX_VALUE)

Hi 

We are now trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8 
option in Luke), but it seems to be garbled. We see a lot of "???". According 
to  
http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/decoders/StringDecoder.java

 issue should be in

  public String decodeTerm(String fieldName, Object value) {


    if (value == null) {


      return "(null)";


    } else if (value instanceof BytesRef) {


      return ((BytesRef)value).utf8ToString();


    } else {


      return value.toString();


    }
  }

In this function, the value should be instance of  BytesRef, then calling the 
.utf8ToString() function will decode the BytesRef to java utf8 string. However, 
for unknown reason, for our index, the value is not BytesRef, I also tested it 
is not CharsRef. So the toString() method is called on the value object and 
result is some ???.

BytesRef and CharsRef is Lucene defined class, to further debug this we may 
need to dig into Lucene code then. Since we dont know what is the real Object 
type value is, if the real type did not overwrite toString function, then 
value.toString() is the default java Object implementation which is the 
hashcode of this object and from eclipse debugger I saw hashcode is 0。


Any advice would be appreciated

thank you

Peyman

luke and chinese text

Reply via email to