[ 
https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713510#comment-16713510
 ] 

Robert Muir commented on LUCENE-8527:
-------------------------------------

It would be really nice. I don't think the tricky part is really segmentation 
at all (as far as finding breaks) but instead the problem of assigning the 
proper "label" to the token (tag it as a emoji type). 

So the stuff in the ICU tokenizer uses some properties to tag the "stuff 
between breaks" as emoji token type versus something else. I looked at latest 
jflex, it seems it would need those props? And its a little tricky, e.g. 
ordinary ascii digit 7 is [:Emoji:] in unicode. So thats why the isEmoji there 
is a bit crazy.


> Upgrade JFlex to 1.7.0
> ----------------------
>
>                 Key: LUCENE-8527
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8527
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: general/build, modules/analysis
>            Reporter: Steve Rowe
>            Priority: Minor
>
> JFlex 1.7.0, supporting Unicode 9.0, was released recently: 
> [http://jflex.de/changelog.html#jflex-1.7.0].  We should upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to