Hi We are indexing some chinese text (using the following outputstreamwriter with UTF-8 enconding).
OutputStreamWriter outputFileWriter = new OutputStreamWriter(new FileOutputStream(outputFile), "utf8"); using lucene 3.2. The analyzer is new LimitTokenCountAnalyzer(new SmartChineseAnalyzer(Version.LUCENE_32,Stopwords),Integer.MAX_VALUE) Hi We are now trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8 option in Luke), but it seems to be garbled. We see a lot of "???". According to http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/decoders/StringDecoder.java issue should be in public String decodeTerm(String fieldName, Object value) { if (value == null) { return "(null)"; } else if (value instanceof BytesRef) { return ((BytesRef)value).utf8ToString(); } else { return value.toString(); } } In this function, the value should be instance of BytesRef, then calling the .utf8ToString() function will decode the BytesRef to java utf8 string. However, for unknown reason, for our index, the value is not BytesRef, I also tested it is not CharsRef. So the toString() method is called on the value object and result is some ???. BytesRef and CharsRef is Lucene defined class, to further debug this we may need to dig into Lucene code then. Since we dont know what is the real Object type value is, if the real type did not overwrite toString function, then value.toString() is the default java Object implementation which is the hashcode of this object and from eclipse debugger I saw hashcode is 0。 Any advice would be appreciated thank you Peyman