Daniel Noll sent the message below addressed to me, and CC'd to java-dev. I guess CC is not good enough for ASF's mailing list software, since I received this message, but it never showed up on the mailing list. Belatedly forwarding it to the list now. - Steve
On 01/07/2008 at 5:06 PM, Daniel Noll wrote: > -----Original Message----- > From: Daniel Noll [mailto:[EMAIL PROTECTED] > Sent: Monday, January 07, 2008 5:07 PM > To: Steven A Rowe > Cc: [email protected] > Subject: Re: Fullwidth alphanumeric characters, plus a > question on Korean ranges > > On Tuesday 08 January 2008 05:17:28 Steven A Rowe wrote: > > Hi Daniel, > > > > I think this discussion belongs on java-dev, so I'm replying there. > > > > On 01/06/2008 at 7:47 PM, Daniel Noll wrote: > > > We discovered [in StandardTokenizer.jj] that fullwidth letters are > > > not treated as <LETTER> and fullwidth digits are not > treated as <DIGIT>. > > > > IMHO, this should be fixed in the JFlex version of StandardTokenizer - > > do you have details? > > The following ranges are relevant here: > > FF10-FF19 Fullwidth digits > FF21-FF3A Fullwidth Latin uppercase > FF41-FF5A Fullwidth Latin lowercase > > > > Line 87: > > > "\uffa0"-"\uffdc" > > > > > > The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> > > > as expected, so I'm wondering if these halfwidth Hangul "letters" > > > should actually be in <KOREAN> instead of <LETTER>. > > > > [U+FFA0-U+FFDC] is Hangul Jamo (phonetic symbols), not precomposed > > Hangul syllables. > > I know. The Unicode spec just happens to call Jamo "letters". > > > However, I just noticed that [U+1100-U+11FF] is included both in the > > <LETTER> and <KOREAN> sections - not good. I think [U+1100-U+11FF] > > should be removed from the <LETTER> definition, and left as-is in the > > <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER> > to <KOREAN>. > > I think so too. I didn't notice this overlap... makes me > wish the parser > could detect character range overlaps and warn about them. > > I had a bit more of a look through the Unicode blocks and > found some more > ranges which may or may not be worth considering. > > These would seem to be worthy of going in <LETTER>: > 2C00-2DDF (multiple blocks which appear to contain more languages) > A720-A7FF Latin Extended-D > A800-A82F Syloti Nagri > A840-A87F Phags-pa > > There are these too, but they seem obscure... > 2460-24FF Enclosed Alphanumerics > > Then you have ligatures, which if you use a normalising > filter later may > resolve to perfectly normal alphabetic characters: > FB00-FB4F Alphabetic Presentation Forms > FB50-FBFF Arabic Presentation Forms > > Then we have some high extensions to CJK. These are > particularly interesting > because they would be represented in UTF-16 as surrogates and > I have no idea > how to even add them to the grammar for that reason. > 20000-2A6DF CJK Unified Ideographs Extension B > 2F800-2FA1F CJK Compatibility Ideographs Supplement > > There may be more hidden in the blocks which don't seem > immediately obvious. > > I wish the tokeniser could just use Character.isLetter and > Character.isDigit instead of having to know all the ranges itself, since > the JRE already has all this information. Character.isLetter does > return true for CJK characters though, so the ranges would still come in > handy for determining what kind of letter they are. I don't support > JFlex has a way to do this... > > Daniel > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
