otis 2004/03/02 05:56:03 Modified: contributions/analyzers/src/java/org/apache/lucene/analysis/cn ChineseTokenizer.java Log: - Added documentation Revision Changes Path 1.4 +18 -1 jakarta-lucene-sandbox/contributions/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java Index: ChineseTokenizer.java =================================================================== RCS file: /home/cvs/jakarta-lucene-sandbox/contributions/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java,v retrieving revision 1.3 retrieving revision 1.4 diff -u -r1.3 -r1.4 --- ChineseTokenizer.java 22 Jan 2004 20:54:47 -0000 1.3 +++ ChineseTokenizer.java 2 Mar 2004 13:56:03 -0000 1.4 @@ -64,6 +64,23 @@ * Rule: A Chinese character as a single token * Copyright: Copyright (c) 2001 * Company: + * + * The difference between thr ChineseTokenizer and the + * CJKTokenizer (id=23545) is that they have different + * token parsing logic. + * + * Let me use an example. If having a Chinese text + * "C1C2C3C4" to be indexed, the tokens returned from the + * ChineseTokenizer are C1, C2, C3, C4. And the tokens + * returned from the CJKTokenizer are C1C2, C2C3, C3C4. + * + * Therefore the index the CJKTokenizer created is much + * larger. + * + * The problem is that when searching for C1, C1C2, C1C3, + * C4C2, C1C2C3 ... the ChineseTokenizer works, but the + * CJKTokenizer will not work. + * * @author Yiyi Sun * @version 1.0 * @@ -149,4 +166,4 @@ } } -} \ No newline at end of file +}
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]