It looks like my attachment was lost. It referred to org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
I'm inlining it here: import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; public class ChineseTokenizerTest { public static void main(String[] args) throws IOException { tokenizeChineseWords("我是中国人"/*"我"(I) "是"(am) "中国" "人"(Chinese = people of China)*/); tokenizeChineseWords("?"); } private static void tokenizeChineseWords(String chineseWords) throws IOException { SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_36); TokenStream tokenizer = analyzer.tokenStream(null/*field name*/, new StringReader(chineseWords)); System.out.print("Sentence: "); print(chineseWords); System.out.println(); System.out.print("Tokens: ["); while (tokenizer.incrementToken()) { CharSequence charTermAttribute = tokenizer.getAttribute(CharTermAttribute.class); print(charTermAttribute); System.out.print(" "); } System.out.println("]"); System.out.println(); } private static void print(CharSequence charTermAttribute) { System.out.print(charTermAttribute); System.out.print("("); for (int i = 0, length = charTermAttribute.length(); i < length; i++) { System.out.print((int) charTermAttribute.charAt(i)); if (i < length-1) System.out.print(" "); } System.out.print(")"); } } From: Robert Muir <rcm...@gmail.com> To: java-user@lucene.apache.org, Date: 01/24/2013 04:31 PM Subject: Re: Chinese analyzer On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc <jerome_lanne...@fr.ibm.com> wrote: > Note the 2 tokens in the second sample when I would expect to have only one > token with the (55401 57046) characters. > > I could not figure out if I'm doing something wrong, or if this is a bug in > the Chinese analyzer. > Which analyzer specifically? there is more than one... --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Sauf indication contraire ci-dessus:/ Unless stated otherwise above: Compagnie IBM France Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex RCS Nanterre 552 118 465 Forme Sociale : S.A.S. Capital Social : 653.242.306,20 � SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A