On 22/12/2011 13:50, Peyman Faratin wrote:
Hi
We are indexing some chinese text (using the following outputstreamwriter with
UTF-8 enconding).
OutputStreamWriter outputFileWriter = new OutputStreamWriter(new
FileOutputStream(outputFile), "utf8");
using lucene 3.2. The analyzer is
new LimitTokenCountAnalyzer(new
SmartChineseAnalyzer(Version.LUCENE_32,Stopwords),Integer.MAX_VALUE)
Hi
We are now trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8 option in
Luke), but it seems to be garbled. We see a lot of "???". According to
http://code.google.com/p/luke/source/browse/trunk/src/org/getopt/luke/decoders/StringDecoder.java
issue should be in
public String decodeTerm(String fieldName, Object value) {
if (value == null) {
return "(null)";
} else if (value instanceof BytesRef) {
return ((BytesRef)value).utf8ToString();
} else {
return value.toString();
}
}
In this function, the value should be instance of BytesRef, then calling the
.utf8ToString() function will decode the BytesRef to java utf8 string. However,
for unknown reason, for our index, the value is not BytesRef, I also tested it
is not CharsRef.
Hmm, then what is it? Just add a println(value.getClass().getName()) and
see what it is.
So the toString() method is called on the value object and result is some ???.
I suspect that the issue could be with the display font - please select
from the Settings menu a font that supports Unicode characters, the
default platform font often doesn't support them, which results in '?'
or other strange characters.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org