Hi Daniel,

I think this discussion belongs on java-dev, so I'm replying there.

On 01/06/2008 at 7:47 PM, Daniel Noll wrote:
> We discovered [in StandardTokenizer.jj] that fullwidth letters are
> not treated as <LETTER> and fullwidth digits are not treated as <DIGIT>.

IMHO, this should be fixed in the JFlex version of StandardTokenizer - do you 
have details?

Concerning handling of Korean characters, some recent StandardTokenizer.jj 
history:

StandardTokenizer loses Korean characters
   http://issues.apache.org/jira/browse/LUCENE-444

StandardTokenizer splitting all of Korean words into separate characters
   http://issues.apache.org/jira/browse/LUCENE-461

CJK char list
   http://issues.apache.org/jira/browse/LUCENE-478

> [W]hile sanity checking the blocks in StandardTokenizer.jj I found
> some suspicious parts and felt it necessary to check that this is by
> design as there is no comment explaining the anomalies.
> 
> Line 87:
>        "\uffa0"-"\uffdc"
> 
>   The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ>
>   as expected, so I'm wondering if these halfwidth Hangul "letters"
>   should actually be in <KOREAN> instead of <LETTER>.

[U+FFA0-U+FFDC] is Hangul Jamo (phonetic symbols), not precomposed Hangul 
syllables.

The patch for LUCENE-478 modified the <LETTER> definition to include this range 
in order to be consistent with inclusion of their full-width versions 
([U+1100-U+11FF])* in the <LETTER> definition, since time immemorial:

<http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?revision=149570&view=markup&pathrev=149570>

However, I just noticed that [U+1100-U+11FF] is included both in the <LETTER> 
and <KOREAN> sections - not good.  I think [U+1100-U+11FF] should be removed 
from the <LETTER> definition, and left as-is in the <KOREAN> section; and 
[U+FFA0-U+FFDC] should be moved from <LETTER> to <KOREAN>.

> Line 92:
>        "\u3040"-"\u318f",
> 
>   This block appears to duplicate the ranges in the next three lines and
>   suspiciously also includes a range which belongs to <KOREAN>, making
>   me wonder what happens when a range is in two blocks.

Otis Gospodnetic expanded this range to include comments on the specific 
ranges, and must have forgotten to remove the original range on line 92:

<http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=431151&r2=431152&pathrev=431152&diff_format=h>

Here are the ranges in question:

   [U+3040-U+309F] - Japanese Hiragana
   [U+30A0-U+30FF] - Japanese Katakana
   [U+3100-U+312F] - Chinese Bopomofo
   [U+3130-U+318F] - Korean Hangul Compatibility Jamo

I agree with your assessment - the range on line 92 should be removed, since 
with the exception of the Hangul compatibility Jamo range, which should be 
moved to the <KOREAN> section, [U+3040-U+318F] is already covered by the 
Hiragana, Katakana, and Bopomofo ranges already included in the <CJ> section.

Of course, since the JavaCC grammar is no longer in Lucene-Java trunk, these 
modifications should be made in StandardTokenizerImpl.jflex.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to