[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414690#comment-16414690 ] ASF subversion and git services commented on LUCENE-8125: - Commit c0b92e279423dbc6852ca2f9cce681604b44d19b in lucene-solr's branch refs/heads/branch_7x from [~rcmuir] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c0b92e2 ] LUCENE-8175: un-revert "LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens"" This was a casualty of war because it relied on new unicode stuff > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Fix For: trunk, 7.3 > > Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, > LUCENE-8125.patch, LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414681#comment-16414681 ] ASF subversion and git services commented on LUCENE-8125: - Commit 23bff7dbc207083af2ccb1b308c121ac18c36508 in lucene-solr's branch refs/heads/master from [~rcmuir] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=23bff7d ] LUCENE-8175: un-revert "LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens"" This was a casualty of war because it relied on new unicode stuff > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Fix For: trunk, 7.3 > > Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, > LUCENE-8125.patch, LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325709#comment-16325709 ] Uwe Schindler commented on LUCENE-8125: --- Thanks Robert! 濾烙 > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Fix For: trunk, 7.3 > > Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, > LUCENE-8125.patch, LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325646#comment-16325646 ] ASF subversion and git services commented on LUCENE-8125: - Commit c9916e3048e98371f056b96cdbaa996f1f36a2fa in lucene-solr's branch refs/heads/branch_7x from [~rcmuir] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c9916e3 ] LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, > LUCENE-8125.patch, LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325641#comment-16325641 ] ASF subversion and git services commented on LUCENE-8125: - Commit 972df6c69de494b8a4f59e4e0d4de241d4ca6a80 in lucene-solr's branch refs/heads/master from [~rcmuir] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=972df6c ] LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, > LUCENE-8125.patch, LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319308#comment-16319308 ] Michael McCandless commented on LUCENE-8125: +1, cool! > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, > LUCENE-8125.patch, LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318092#comment-16318092 ] Adrien Grand commented on LUCENE-8125: -- +1 > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, > LUCENE-8125.patch, LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317274#comment-16317274 ] Robert Muir commented on LUCENE-8125: - Note: I think it'd be nice to fix for standardtokenizer at some point too, but we need to first bring its grammar up to the latest unicode i think? This way it will have the latest uax#29 rules around this stuff such as "Do not break within emoji zwj sequences." So some work to do for that, but we can tackle here with ICU first. > emoji sequence support in ICUTokenizer > -- > > Key: LUCENE-8125 > URL: https://issues.apache.org/jira/browse/LUCENE-8125 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Attachments: LUCENE-8125.patch > > > uax29 word break rules already know how to handle these correctly, we just > need to assign them a token type. > This is better than users trying to do this with custom rules (e.g. > LUCENE-7916) because they are script-independent (common/inherited). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org