[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

Uwe Schindler (Jira) Thu, 08 Apr 2021 05:46:28 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317155#comment-17317155
 ]


Uwe Schindler commented on LUCENE-9914:
---------------------------------------

That's really fast. The reason why I did it like that was that ICU should be no 
runtime dependency, so it is just extracting data and providing it to 
CharTokenizer as a Bits interface (backed by a sparse bitset). The script only 
takes milliseconds. 😜

Maybe we can just extend the class UnicodeData to contain Emoji codepoints in a 
similar way and let the jflex code depend on it.

Because of my bad experience with the domain name tokenizer, I tend to think 
that the FSA should only contain some "best guess" like unicode ranges so FSA 
is small. In the jflex callback the lookup of exact emoji could be done and 
everything which is not emoji handled back to jflex as no match.

IMHO for the domain name standard tokenizer it should maybe done similar: just 
match anything that looks like a domain and do a separate check on possible 
matches.

> Modernize Emoji regeneration scripts
> ------------------------------------
>
>                 Key: LUCENE-9914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

Reply via email to