[
https://issues.apache.org/jira/browse/LUCENE-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033934#comment-14033934
]
Robert Muir commented on LUCENE-5770:
-------------------------------------
I'm not worried about it. We should also consider the data: wikipedia is a
little strange in that it has an abnormally high presence of these characters
versus most content.
I tried to optimize the fast path in CharTokenizer/LowerCaseFilter etc just as
an experiment (because when you look at codePointAt/count you see a lot of
checks etc) but when i ran it on data from blogs/tweets etc it made no
difference at all: i was also struggling with noise in mike's benchmark.
> Upgrade JFlex to 1.6.0
> ----------------------
>
> Key: LUCENE-5770
> URL: https://issues.apache.org/jira/browse/LUCENE-5770
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Priority: Minor
> Fix For: 5.0, 4.10
>
> Attachments: LUCENE-5770.patch
>
>
> JFlex 1.6, to be released shortly, will have direct support for supplementary
> code points - JFlex 1.5 and earlier only support code points in the BMP.
> We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex
> scanner specifications to handle supplementary code points.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]