RE: Fullwidth alphanumeric characters, plus a question on Korean ranges

Steven A Rowe Thu, 10 Jan 2008 11:55:42 -0800

Hi Daniel,

On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> I wish the tokeniser could just use Character.isLetter and
> Character.isDigit instead of having to know all the ranges itself, since
> the JRE already has all this information.  Character.isLetter does
> return true for CJK characters though, so the ranges would still come in
> handy for determining what kind of letter they are.  I don't support
> JFlex has a way to do this...

Well, a quick perusal of the JFlex docs indicate that just such a facility is 
available.  From <http://jflex.de/manual.html#SECTION00053000000000000000> 
(edited for brevity: '...' indicates elided material):

    -----
    Lexical Rules : Syntax
    ...
       RegExp       ::= RegExp '|' RegExp | ... | PredefinedClass | ...
       PredefinedClass ::= ... | '[:letter:]' | '[:digit:]' | ...
    ...
    Lexical Rules : Semantics
    ...
       [:letter:]       isLetter()
       [:digit:]        isDigit()
    -----

The DIGIT macro could be replaced by the predefined character class [:digit:].

Although isLetter() (and so also [:letter:]) includes CJK characters, there is 
a way to handle this - from Lexical Rules : Semantics 
(<http://jflex.de/manual.html#SECTION00053000000000000000>):

    -----
    !a
        (negation)

        matches everything but the strings matched by a. Use with care:
        the construction of !a involves an additional, possibly exponential
        NFA to DFA transformation on the NFA for a. Note that with negation
        and union you also have (by applying DeMorgan) intersection and set
        difference: the intersection of a and b is !(!a|!b), the expression
        that matches everything of a not matched by b is !(!a|b) 
    -----

Using the /!(!a|b)/ syntax to exclude CJ characters from the LETTER macro:

    LETTER = ! ( ! [:letter:] | {CJ} )

> On Tuesday 08 January 2008 05:17:28 Steven A Rowe wrote:
> > On 01/06/2008 at 7:47 PM, Daniel Noll wrote:
> > > We discovered [in StandardTokenizer.jj] that fullwidth letters are
> > > not treated as <LETTER> and fullwidth digits are not
> > > treated as <DIGIT>.
> > 
> > IMHO, this should be fixed in the JFlex version of StandardTokenizer -
> > do you have details?
> 
> The following ranges are relevant here:
> 
>   FF10-FF19  Fullwidth digits
>   FF21-FF3A  Fullwidth Latin uppercase
>   FF41-FF5A  Fullwidth Latin lowercase

Note that these are properly covered by [:digit:] and [:letter:].

> > > Line 87:
> > >        "\uffa0"-"\uffdc"
> > > 
> > >   The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ>
> > >   as expected, so I'm wondering if these halfwidth Hangul "letters"
> > >   should actually be in <KOREAN> instead of <LETTER>.
> 
> > However, I just noticed that [U+1100-U+11FF] is included both in the
> > <LETTER> and <KOREAN> sections - not good.  I think [U+1100-U+11FF]
> > should be removed from the <LETTER> definition, and left as-is in the
> > <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER>
> > to <KOREAN>.

Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) 
to treat them separately; unlike Chinese and Japanese characters, which are 
individually tokenized, the Korean characters should participate in the same 
token boundary rules as all of the other letters.

> I had a bit more of a look through the Unicode blocks and
> found some more ranges which may or may not be worth considering.

I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 
supports, and Unicode 5.0, the latest version, and there are lots of new and 
modified letter and digit ranges.  This stuff gets tweaked all the time, and I 
don't think Lucene should be in the business of trying to track it, or take a 
position on which Unicode version users' data should conform to.  

Switching to using JFlex's [:letter:] and [:digit:] predefined character 
classes ties (most of) these decisions to the user's choice of JVM version, and 
this seems much more reasonable to me than the current status quo.

I will create a JIRA issue and attach a patch.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Fullwidth alphanumeric characters, plus a question on Korean ranges

Reply via email to