[ 
https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380241#comment-17380241
 ] 

ASF subversion and git services commented on LUCENE-9177:
---------------------------------------------------------

Commit c3482c99ffd9b30acb423e63760ebc7baab9dd26 in lucene's branch 
refs/heads/main from Michael Gibney
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c3482c9 ]

LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence 
of normalization-inert characters (#199)

Normalization-inert characters need not be required as boundaries
for incremental processing. It is sufficient to check `hasBoundaryAfter`
and `hasBoundaryBefore`, substantially improving worst-case performance.

> ICUNormalizer2CharFilter worst case is very slow
> ------------------------------------------------
>
>                 Key: LUCENE-9177
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9177
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>             Fix For: main (9.0), 8.10
>
>         Attachments: LUCENE-9177-benchmark-test.patch, 
> LUCENE-9177_LUCENE-8972.patch, lucene.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> ICUNormalizer2CharFilter is fast most of the times but we've had some report 
> in Elasticsearch that some unrealistic data can slow down the process very 
> significantly. For instance an input that consists of characters to normalize 
> with no normalization-inert character in between can take up to several 
> seconds to process few hundreds of kilo-bytes on my machine. While the input 
> is not realistic, this worst case can slow down indexing considerably when 
> dealing with uncleaned data.
> I attached a small test that reproduces the slow processing using a stream 
> that contains a lot of repetition of the character `℃` and no 
> normalization-inert character. I am not surprised that the processing is 
> slower than usual but several seconds to process seems a lot. Adding 
> normalization-inert character makes the processing a lot more faster so I 
> wonder if we can improve the process to split the input more eagerly ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to