Hi Ivan, Ivan Vasilev wrote: > But how to understand the meaning of this: “To overcome this, you > have to index chinese characters as single tokens (this will increase > recall, but decrease precision).” > > I understand it so: To increase the results I have to use instead of > the Chinese another analyzer that makes tokenization of the text > character by character.
StandardTokenizer[1] produces single-character tokens for Chinese ideographs and Japanese kana. However, AFAIK, you will no longer be able to perform range searches like [AG TO PQ], because the terms "AG" and "PQ" will not be present in the index. [A TO P] should work, but I don't know how useful the results would be, since this would match all words that contain the ideographs [A TO P], not just those that start with them. (Note that this is also the case with the bigram tokens produced by CJKAnalyzer.) By the way, what is the use case for matching a range of words? Doesn't exposing this kind of functionality cause performance concerns? Steve [1] Lucene's StandardTokenizer API doc: <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html> -- Steve Rowe Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]