[ 
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153209#comment-15153209
 ] 

Steve Rowe commented on LUCENE-6993:
------------------------------------

[~mdrob], I haven't looked at your patch yet but there is a non-rote Unicode 
upgrade item that needs to be dealt with - from LUCENE-5357's TODO list:

* Upgrade the UAX#29-based grammars to the Unicode -6.3- _8.0_ word break 
rules, in StandardTokenizerImpl.jflex and UAX29URLEmailTokenizer.jflex.

UAX#29 word break rules can (and usually do) change with each Unicode release, 
so we'll need to review the changes between 6.3 and 8.0 and see what, if 
anything, needs changing in the tokenizer grammars.  Another item from the 
LUCENE-5357 TODO list will confirm that this has been done correctly:

* Test the new scanners against the Unicode 6.3 word break test data
** \[...]

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all 
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the 
> list of TLDs again. Comparing our old list with a new list indicates 800+ new 
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to