[ 
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169638#comment-15169638
 ] 

Steve Rowe commented on LUCENE-6993:
------------------------------------

{{ClassicTokenizer}} does have direct Unicode version dependencies: 
{{\[:digit:]}} and {{\[:alpha:]}} are the equivalent of {{\p\{Digit} and 
\p\{Letter},}} respectively.  Right now those definitions are pinned at Unicode 
3.0, which means that characters added since Unicode 3.0 (released 15 years 
ago, in 2000) will not be properly tokenized.

Also, there are several effectively-pinned character sets (for CJK) that are 
hard-coded in the grammar, and don't include any supplementary characters at 
all.  If the Unicode version changes, these will need to be moved to use the 
appropriate Unicode properties instead.

I guess I'm -0 on leaving the Unicode version as-is because of the above, but 
since this tokenizer will never be removed, it seems bad to me to keep it 
pinned to such an old Unicode version.

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all 
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the 
> list of TLDs again. Comparing our old list with a new list indicates 800+ new 
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to