Hi Daniel, I think this discussion belongs on java-dev, so I'm replying there.
On 01/06/2008 at 7:47 PM, Daniel Noll wrote: > We discovered [in StandardTokenizer.jj] that fullwidth letters are > not treated as <LETTER> and fullwidth digits are not treated as <DIGIT>. IMHO, this should be fixed in the JFlex version of StandardTokenizer - do you have details? Concerning handling of Korean characters, some recent StandardTokenizer.jj history: StandardTokenizer loses Korean characters http://issues.apache.org/jira/browse/LUCENE-444 StandardTokenizer splitting all of Korean words into separate characters http://issues.apache.org/jira/browse/LUCENE-461 CJK char list http://issues.apache.org/jira/browse/LUCENE-478 > [W]hile sanity checking the blocks in StandardTokenizer.jj I found > some suspicious parts and felt it necessary to check that this is by > design as there is no comment explaining the anomalies. > > Line 87: > "\uffa0"-"\uffdc" > > The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> > as expected, so I'm wondering if these halfwidth Hangul "letters" > should actually be in <KOREAN> instead of <LETTER>. [U+FFA0-U+FFDC] is Hangul Jamo (phonetic symbols), not precomposed Hangul syllables. The patch for LUCENE-478 modified the <LETTER> definition to include this range in order to be consistent with inclusion of their full-width versions ([U+1100-U+11FF])* in the <LETTER> definition, since time immemorial: <http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?revision=149570&view=markup&pathrev=149570> However, I just noticed that [U+1100-U+11FF] is included both in the <LETTER> and <KOREAN> sections - not good. I think [U+1100-U+11FF] should be removed from the <LETTER> definition, and left as-is in the <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER> to <KOREAN>. > Line 92: > "\u3040"-"\u318f", > > This block appears to duplicate the ranges in the next three lines and > suspiciously also includes a range which belongs to <KOREAN>, making > me wonder what happens when a range is in two blocks. Otis Gospodnetic expanded this range to include comments on the specific ranges, and must have forgotten to remove the original range on line 92: <http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=431151&r2=431152&pathrev=431152&diff_format=h> Here are the ranges in question: [U+3040-U+309F] - Japanese Hiragana [U+30A0-U+30FF] - Japanese Katakana [U+3100-U+312F] - Chinese Bopomofo [U+3130-U+318F] - Korean Hangul Compatibility Jamo I agree with your assessment - the range on line 92 should be removed, since with the exception of the Hangul compatibility Jamo range, which should be moved to the <KOREAN> section, [U+3040-U+318F] is already covered by the Hiragana, Katakana, and Bopomofo ranges already included in the <CJ> section. Of course, since the JavaCC grammar is no longer in Lucene-Java trunk, these modifications should be made in StandardTokenizerImpl.jflex. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]