[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-03-26 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414690#comment-16414690
 ] 

ASF subversion and git services commented on LUCENE-8125:
-

Commit c0b92e279423dbc6852ca2f9cce681604b44d19b in lucene-solr's branch 
refs/heads/branch_7x from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c0b92e2 ]

LUCENE-8175: un-revert "LUCENE-8125: ICUTokenizer support for emoji/emoji 
sequence tokens""

This was a casualty of war because it relied on new unicode stuff


> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Fix For: trunk, 7.3
>
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, 
> LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-03-26 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414681#comment-16414681
 ] 

ASF subversion and git services commented on LUCENE-8125:
-

Commit 23bff7dbc207083af2ccb1b308c121ac18c36508 in lucene-solr's branch 
refs/heads/master from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=23bff7d ]

LUCENE-8175: un-revert "LUCENE-8125: ICUTokenizer support for emoji/emoji 
sequence tokens""

This was a casualty of war because it relied on new unicode stuff


> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Fix For: trunk, 7.3
>
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, 
> LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-01-14 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325709#comment-16325709
 ] 

Uwe Schindler commented on LUCENE-8125:
---

Thanks Robert! 濾烙

> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Fix For: trunk, 7.3
>
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, 
> LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-01-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325646#comment-16325646
 ] 

ASF subversion and git services commented on LUCENE-8125:
-

Commit c9916e3048e98371f056b96cdbaa996f1f36a2fa in lucene-solr's branch 
refs/heads/branch_7x from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c9916e3 ]

LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens


> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, 
> LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-01-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325641#comment-16325641
 ] 

ASF subversion and git services commented on LUCENE-8125:
-

Commit 972df6c69de494b8a4f59e4e0d4de241d4ca6a80 in lucene-solr's branch 
refs/heads/master from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=972df6c ]

LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens


> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, 
> LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319308#comment-16319308
 ] 

Michael McCandless commented on LUCENE-8125:


+1, cool!

> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, 
> LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-01-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318092#comment-16318092
 ] 

Adrien Grand commented on LUCENE-8125:
--

+1

> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch, 
> LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8125) emoji sequence support in ICUTokenizer

2018-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317274#comment-16317274
 ] 

Robert Muir commented on LUCENE-8125:
-

Note: I think it'd be nice to fix for standardtokenizer at some point too, but 
we need to first bring its grammar up to the latest unicode i think? This way 
it will have the latest uax#29 rules around this stuff such as "Do not break 
within emoji zwj sequences." So some work to do for that, but we can tackle 
here with ICU first.

> emoji sequence support in ICUTokenizer
> --
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org