org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of
lucene, so I'm not sure if this is the right place to mention this or not. I
was doing some detailed analysis of how this tokenizer worked and noticed the
following behavior (which I would classify as a bug).
If you pass the word "construccion" to the tokenizer, it returns a single
token: "construccion". That seems correct. If you pass the word
"construcción" to this tokenizer, it will generate three tokens: "construcci",
"ó", and "n". This is happens because the accented "o" is not treated as a
Latin-1 character. Splitting the word seems like a bug and violates the "does
a decent job for most European languages" statement.
The fix seems straight forward. I replaced the following 2 lines (in the
CJKTokenizer class):
if ((ub == Character.UnicodeBlock.BASIC_LATIN)
|| (ub ==
Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
With
if ((ub == Character.UnicodeBlock.BASIC_LATIN) //
chars 0x00-0x7f
|| (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) //
char 0x80-0xff
|| (ub == Character.UnicodeBlock.LATIN_EXTENDED_A) //
char 0x100-0x17f
|| (ub ==
Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
Am I missing something or does this seem like a reasonable thing to want to do?