[jira] [Commented] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

Mike Drob (JIRA) Mon, 29 Feb 2016 10:34:50 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172324#comment-15172324
 ]


Mike Drob commented on LUCENE-6993:
-----------------------------------

I think I am getting to a good place here, just a few more issues that I need 
some additional direction --

{code}
  /**
   * Sets the scanner buffer size in chars
   */
   public final void setBufferSize(int numChars) {
     ZZ_BUFFERSIZE = numChars;
     char[] newZzBuffer = new char[ZZ_BUFFERSIZE];
     System.arraycopy(zzBuffer, 0, newZzBuffer, 0, Math.min(zzBuffer.length, 
ZZ_BUFFERSIZE));
     zzBuffer = newZzBuffer;
   }
{code}
This is code that we inject directly from our jflex templates, not generated 
code. True to their promises, ZZ prefixed items in jflex are subject to change, 
and this one has become final between old and new versions. We could fix this 
with an additional post-processing step to take out the final modifier, or put 
changes in jflex to add a new constructor or something like that. It looks like 
all of the non-test usage of setting the size happens immediately post 
construction.

bq. Also, there are several effectively-pinned character sets (for CJK and 
Thai) that are hard-coded in the grammar, and don't include any supplementary 
characters at all. If the Unicode version changes, these will need to be moved 
to use the appropriate Unicode properties instead.
Currently ClassicTokenizer has {{THAI       = \[\u0E00-\u0E59]; ALPHANUM   = 
(\{LETTER}\|\{THAI}|\[:digit:])+;}}. If I understand the Unicode spec correctly 
with Unicode 8.0 we can remove the THAI declaration and it would be correctly 
included in LETTER. But I have near zero confidence in this. Alternatively, 
leaving it as is should be fine because the assigned THAI characters have not 
gone outside of that range.
For CJK, we have a special call out for CJ, but K was apparently already 
included in LETTER? I don't understand the relationship between ALPHANUM, 
{{\p\{Letter\}}} and CJK.

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all 
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the 
> list of TLDs again. Comparing our old list with a new list indicates 800+ new 
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

Reply via email to