Hi, I'm using the 3.6.1 Chinese analyzer and when tokenizing some Chinese words containing CJK Unified Ideographs Extension B characters, the resulting tokens do not contain the original words. Instead it seems that the CJK Unified Ideographs Extension B characters are split in two characters.
In the attached example, the output is: Sentence: 我是中国人(25105 26159 20013 22269 20154) Tokens: [我(25105) 是(26159) 中国(20013 22269) 人(20154) ] Sentence: ?(55401 57046) Tokens: [?(55401) ?(57046) ] Note the 2 tokens in the second sample when I would expect to have only one token with the (55401 57046) characters. I could not figure out if I'm doing something wrong, or if this is a bug in the Chinese analyzer. Thanks, Jerome Sauf indication contraire ci-dessus:/ Unless stated otherwise above: Compagnie IBM France Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex RCS Nanterre 552 118 465 Forme Sociale : S.A.S. Capital Social : 653.242.306,20 � SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org