[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-8125: -------------------------------- Attachment: LUCENE-8125.patch Here's a patch. I did more cleanup of outdated breakiterator stuff while I was here. Its not needed after the ICU upgrade (LUCENE-8122). I added some simple tests, e.g. sequences such as 👩❤️👩 (WOMAN + ZWJ + HEAVY BLACK HEART + VARIATION SELECTOR-16 + ZWJ + WOMAN) are recognized as one token because the rules already knew that. the filters we have such as ICUNormalizer2Filter/ICUFoldingFilter would reduce the above to WOMAN + HEAVY BLACK HEART + WOMAN, because they remove the default ignorables. > emoji sequence support in ICUTokenizer > -------------------------------------- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Robert Muir > Attachments: LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org