CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong ----------------------------------------------------------
Key: LUCENE-1490 URL: https://issues.apache.org/jira/browse/LUCENE-1490 Project: Lucene - Java Issue Type: Bug Reporter: Daniel Cheng Fix For: 2.4 CJKTokenizer have these lines.. if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) { /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */ int i = (int) c; i = i - 65248; c = (char) i; } This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN counterparts. Only 65281-65374 can be converted this way. The fix is if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS && i <= 65474 && i> 65281) { /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */ int i = (int) c; i = i - 65248; c = (char) i; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org