Answering myself for next generations' sake.
Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS does the job.
Example:
import junit.framework.Assert;
import org.junit.Test;
public class DetectCJK {
@Test
public void test1() {
Assert.assertEquals(Character.UnicodeBlock.BASIC_LATIN,
Character.UnicodeBlock.of('a'));
Assert.assertEquals(Character.UnicodeBlock.HEBREW,
Character.UnicodeBlock.of('א'));
Assert.assertEquals("Traditional Chinese: Electricity",
Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of('電'));
Assert.assertEquals("Simplified Chinese: Electricity",
Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of('电'));
Assert.assertEquals("Simplified Chinese: Japanese",
Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of('電'));
String chineseWritingStr = "漢字/汉字";
int length = chineseWritingStr.codePointCount(0,
chineseWritingStr.length()-1);
for (int i=0; i<length; i++) {
int codePoint = chineseWritingStr.codePointAt(0);
Assert.assertEquals("Chinese: Chinese writing",
Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of(codePoint));
}
}
}
On Fri, Feb 22, 2013 at 12:51 AM, Gili Nachum <[email protected]> wrote:
> Hello, Is there anything in the Lucene core/contrib that could help detect
> if a keyword is CJK or not?
> I was thinking that an okay heuristic might be to inspect if the keyword's
> characters unicode value is within CJK ranges. Anything that does that?
>
> I'm seeing really bad performance when users query for keywords with a
> wildcard (say: "abc*") . Therefore, as a defensive measure, I plan to
> restrict wildcard queries to have a minimum of 4 characters (e.g., reject
> "abc*" allow "abcd*").
> However, for CJK keywords, I would like to make an exception, since in CJK
> just one or two letters stand for a distinct word (I'm okay that some CJK
> characters are not words, but are phonetic in nature).
>
> Thanks.
> Gili.
>