[ 
https://issues.apache.org/jira/browse/LUCENE-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033934#comment-14033934
 ] 

Robert Muir commented on LUCENE-5770:
-------------------------------------

I'm not worried about it. We should also consider the data: wikipedia is a 
little strange in that it has an abnormally high presence of these characters 
versus most content. 

I tried to optimize the fast path in CharTokenizer/LowerCaseFilter etc just as 
an experiment (because when you look at codePointAt/count you see a lot of 
checks etc) but when i ran it on data from blogs/tweets etc it made no 
difference at all: i was also struggling with noise in mike's benchmark.

> Upgrade JFlex to 1.6.0
> ----------------------
>
>                 Key: LUCENE-5770
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5770
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>            Priority: Minor
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5770.patch
>
>
> JFlex 1.6, to be released shortly, will have direct support for supplementary 
> code points - JFlex 1.5 and earlier only support code points in the BMP.
> We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex 
> scanner specifications to handle supplementary code points.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to