Re: The best Chinese Analyzer?

Ray Tsang Mon, 08 May 2006 02:29:04 -0700

Hi Bob,

In short, I use a slightly modified ChineseAnalyzer to index chinese text.
They differ mainly in the way they tokenize the text.


StandardAnalyzer is inteded to use w/ Latin-based languages, that each
word composes of multiple characters, and each word is separated by
special markers such as a space ' ', a comma, a period, a new line...
etc.. so "C1C2C3" (space) "C4C5C6" will be tokenized into 2 terms:
"C1C2C3" and "C4C5C6"

CJKAnalyzer tokenizes Chinese text into 2-grams (from
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java?rev=165565&view=markup)
"C1C2C3C4" -> "C1C2" "C2C3" "C3C4"

ChineseAnalyzer tokenizes Chinese text into 1-gram
(http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java?rev=353930&view=markup)
"C1C2C3C4" -> "C1" "C2" "C2" "C3" "C3" "C4"

The most obvious result of these 3 tokenization tokenization
strategies is the search results.
Suppose you search for "C2C3", you can only find it w/
ChineseAnalyzer, but not the other 2 with the above example.

ray,


On 5/8/06, Bob Cheung <[EMAIL PROTECTED]> wrote:

I have a question for those who have used Lucene to index and search for
Chinese Characters, what is the best Analyzer for the job?

I know all these three can do the job:

1. StandardAnalyzer
2. CJKAnalyzer
3. ChineseAnalyzer

What are the difference between these 3 analyzers?

TIA.

Regards,
Bob

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: The best Chinese Analyzer?

Reply via email to