[
https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir resolved LUCENE-9177.
---------------------------------
Resolution: Fixed
> ICUNormalizer2CharFilter worst case is very slow
> ------------------------------------------------
>
> Key: LUCENE-9177
> URL: https://issues.apache.org/jira/browse/LUCENE-9177
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Minor
> Fix For: main (9.0), 8.10
>
> Attachments: LUCENE-9177-benchmark-test.patch,
> LUCENE-9177_LUCENE-8972.patch, lucene.patch
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> ICUNormalizer2CharFilter is fast most of the times but we've had some report
> in Elasticsearch that some unrealistic data can slow down the process very
> significantly. For instance an input that consists of characters to normalize
> with no normalization-inert character in between can take up to several
> seconds to process few hundreds of kilo-bytes on my machine. While the input
> is not realistic, this worst case can slow down indexing considerably when
> dealing with uncleaned data.
> I attached a small test that reproduces the slow processing using a stream
> that contains a lot of repetition of the character `℃` and no
> normalization-inert character. I am not surprised that the processing is
> slower than usual but several seconds to process seems a lot. Adding
> normalization-inert character makes the processing a lot more faster so I
> wonder if we can improve the process to split the input more eagerly ?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]