[ 
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154949#comment-15154949
 ] 

Mike Drob commented on LUCENE-6993:
-----------------------------------

bq. I think we need to regenerate still, because there are new 
characters/character property changes so the actual tokenizer will change (even 
if the rules stay the same: the alphabet got bigger).
Ok. My current plan will be to copy all existing tokenizers to std50 packages, 
update the factories to be cognizant of lucene version, update current jflex 
files to all use unicode 8.0 and then regenerate all of the new tokenizer 
classes.

Some of the tokenizers have a unicode 3.0 directive, which indicates that they 
haven't been touched in a long time. This worries me a bit, but I'll see how it 
goes.

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all 
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the 
> list of TLDs again. Comparing our old list with a new list indicates 800+ new 
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to