[
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317155#comment-17317155
]
Uwe Schindler commented on LUCENE-9914:
---------------------------------------
That's really fast. The reason why I did it like that was that ICU should be no
runtime dependency, so it is just extracting data and providing it to
CharTokenizer as a Bits interface (backed by a sparse bitset). The script only
takes milliseconds. 😜
Maybe we can just extend the class UnicodeData to contain Emoji codepoints in a
similar way and let the jflex code depend on it.
Because of my bad experience with the domain name tokenizer, I tend to think
that the FSA should only contain some "best guess" like unicode ranges so FSA
is small. In the jflex callback the lookup of exact emoji could be done and
everything which is not emoji handled back to jflex as no match.
IMHO for the domain name standard tokenizer it should maybe done similar: just
match anything that looks like a domain and do a separate check on possible
matches.
> Modernize Emoji regeneration scripts
> ------------------------------------
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they
> haven't been used in a while. They don't seem too scary (for perl) - just
> fetch emoji unicode descriptions and parse them into a jflex macro and a test
> case.
> It'd be good to convert them to use python, groovy or even java so that they
> fit better in the build system. Alternatively - perhaps there is a way to get
> these codepoint properties from Java directly?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]