[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713510#comment-16713510 ]
Robert Muir commented on LUCENE-8527: ------------------------------------- It would be really nice. I don't think the tricky part is really segmentation at all (as far as finding breaks) but instead the problem of assigning the proper "label" to the token (tag it as a emoji type). So the stuff in the ICU tokenizer uses some properties to tag the "stuff between breaks" as emoji token type versus something else. I looked at latest jflex, it seems it would need those props? And its a little tricky, e.g. ordinary ascii digit 7 is [:Emoji:] in unicode. So thats why the isEmoji there is a bit crazy. > Upgrade JFlex to 1.7.0 > ---------------------- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis > Reporter: Steve Rowe > Priority: Minor > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org