Hi Bob, In short, I use a slightly modified ChineseAnalyzer to index chinese text. They differ mainly in the way they tokenize the text.
StandardAnalyzer is inteded to use w/ Latin-based languages, that each word composes of multiple characters, and each word is separated by special markers such as a space ' ', a comma, a period, a new line... etc.. so "C1C2C3" (space) "C4C5C6" will be tokenized into 2 terms: "C1C2C3" and "C4C5C6" CJKAnalyzer tokenizes Chinese text into 2-grams (from http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java?rev=165565&view=markup) "C1C2C3C4" -> "C1C2" "C2C3" "C3C4" ChineseAnalyzer tokenizes Chinese text into 1-gram (http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java?rev=353930&view=markup) "C1C2C3C4" -> "C1" "C2" "C2" "C3" "C3" "C4" The most obvious result of these 3 tokenization tokenization strategies is the search results. Suppose you search for "C2C3", you can only find it w/ ChineseAnalyzer, but not the other 2 with the above example. ray, On 5/8/06, Bob Cheung <[EMAIL PROTECTED]> wrote:
I have a question for those who have used Lucene to index and search for Chinese Characters, what is the best Analyzer for the job? I know all these three can do the job: 1. StandardAnalyzer 2. CJKAnalyzer 3. ChineseAnalyzer What are the difference between these 3 analyzers? TIA. Regards, Bob --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]