Re: Is there bug in CJKAnalyzer?

Steven Rowe Tue, 23 Oct 2007 08:41:07 -0700

Hi Ivan,

Ivan Vasilev wrote:
> But how to understand the meaning of this: “To overcome this, you
> have to index chinese characters as single tokens (this will increase
> recall, but decrease precision).”
> 
> I understand it so: To increase the results I have to use instead of 
> the Chinese another analyzer that makes tokenization of the text 
> character by character.


StandardTokenizer[1] produces single-character tokens for Chinese
ideographs and Japanese kana.

However, AFAIK, you will no longer be able to perform range searches
like [AG TO PQ], because the terms "AG" and "PQ" will not be present in
the index.  [A TO P] should work, but I don't know how useful the
results would be, since this would match all words that contain the
ideographs [A TO P], not just those that start with them.  (Note that
this is also the case with the bigram tokens produced by CJKAnalyzer.)

By the way, what is the use case for matching a range of words?  Doesn't
exposing this kind of functionality cause performance concerns?

Steve

[1] Lucene's StandardTokenizer API doc:
<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Is there bug in CJKAnalyzer?

Reply via email to