Hi Daniel, On 09/22/2008 at 12:49 AM, Daniel Noll wrote: > I have a question about Korean tokenisation. Currently there > is a rule in StandardTokenizerImpl.jflex which looks like this: > > ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+
LUCENE-1126 <https://issues.apache.org/jira/browse/LUCENE-1126> changed StandaradTokenizerImpl.jflex, for trunk and the looming 2.4 release. ALPHANUM now looks like: THAI = [\u0E00-\u0E59] // basic word: a sequence of digits & letters // (includes Thai to enable ThaiAnalyzer to function) ALPHANUM = ({LETTER}|{THAI}|[:digit:])+ // From the JFlex manual: "the expression that matches everything of // <a> not matched by <b> is !(!<a>|<b>)" LETTER = !(![:letter:]|{CJ}) In JFlex grammars, [:letter:] is translated to the set of chars that return true for Character.isLetter(); this includes Chinese, Japanese and Korean characters. Similarly, [:digit:] -> Character.isDigit(). Although the grammar looks different, the result is the same: Korean characters are still grouped together with digits, as you noted, rather than like Chinese and Japanese, for which each character is separately tokenized. Korean has been treated differently from Chinese and Japanese since LUCENE-461 <https://issues.apache.org/jira/browse/LUCENE-461>. The grouping of Hangul with digits was introduced in this issue. > I'm wondering if there was some good reason why it isn't: > > ALPHANUM = (({LETTER}|{DIGIT})+|{KOREAN}+) Since LUCENE-1126 removed separate handling for Korean, it would have to be re-introduced in order to make a change like this. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]