Hi Daniel, On 01/07/2008 at 5:06 PM, Daniel Noll wrote: > I wish the tokeniser could just use Character.isLetter and > Character.isDigit instead of having to know all the ranges itself, since > the JRE already has all this information. Character.isLetter does > return true for CJK characters though, so the ranges would still come in > handy for determining what kind of letter they are. I don't support > JFlex has a way to do this...
Well, a quick perusal of the JFlex docs indicate that just such a facility is available. From <http://jflex.de/manual.html#SECTION00053000000000000000> (edited for brevity: '...' indicates elided material): ----- Lexical Rules : Syntax ... RegExp ::= RegExp '|' RegExp | ... | PredefinedClass | ... PredefinedClass ::= ... | '[:letter:]' | '[:digit:]' | ... ... Lexical Rules : Semantics ... [:letter:] isLetter() [:digit:] isDigit() ----- The DIGIT macro could be replaced by the predefined character class [:digit:]. Although isLetter() (and so also [:letter:]) includes CJK characters, there is a way to handle this - from Lexical Rules : Semantics (<http://jflex.de/manual.html#SECTION00053000000000000000>): ----- !a (negation) matches everything but the strings matched by a. Use with care: the construction of !a involves an additional, possibly exponential NFA to DFA transformation on the NFA for a. Note that with negation and union you also have (by applying DeMorgan) intersection and set difference: the intersection of a and b is !(!a|!b), the expression that matches everything of a not matched by b is !(!a|b) ----- Using the /!(!a|b)/ syntax to exclude CJ characters from the LETTER macro: LETTER = ! ( ! [:letter:] | {CJ} ) > On Tuesday 08 January 2008 05:17:28 Steven A Rowe wrote: > > On 01/06/2008 at 7:47 PM, Daniel Noll wrote: > > > We discovered [in StandardTokenizer.jj] that fullwidth letters are > > > not treated as <LETTER> and fullwidth digits are not > > > treated as <DIGIT>. > > > > IMHO, this should be fixed in the JFlex version of StandardTokenizer - > > do you have details? > > The following ranges are relevant here: > > FF10-FF19 Fullwidth digits > FF21-FF3A Fullwidth Latin uppercase > FF41-FF5A Fullwidth Latin lowercase Note that these are properly covered by [:digit:] and [:letter:]. > > > Line 87: > > > "\uffa0"-"\uffdc" > > > > > > The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> > > > as expected, so I'm wondering if these halfwidth Hangul "letters" > > > should actually be in <KOREAN> instead of <LETTER>. > > > However, I just noticed that [U+1100-U+11FF] is included both in the > > <LETTER> and <KOREAN> sections - not good. I think [U+1100-U+11FF] > > should be removed from the <LETTER> definition, and left as-is in the > > <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER> > > to <KOREAN>. Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them separately; unlike Chinese and Japanese characters, which are individually tokenized, the Korean characters should participate in the same token boundary rules as all of the other letters. > I had a bit more of a look through the Unicode blocks and > found some more ranges which may or may not be worth considering. I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges. This stuff gets tweaked all the time, and I don't think Lucene should be in the business of trying to track it, or take a position on which Unicode version users' data should conform to. Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most of) these decisions to the user's choice of JVM version, and this seems much more reasonable to me than the current status quo. I will create a JIRA issue and attach a patch. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]