[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380241#comment-17380241 ]
ASF subversion and git services commented on LUCENE-9177: --------------------------------------------------------- Commit c3482c99ffd9b30acb423e63760ebc7baab9dd26 in lucene's branch refs/heads/main from Michael Gibney [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c3482c9 ] LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters (#199) Normalization-inert characters need not be required as boundaries for incremental processing. It is sufficient to check `hasBoundaryAfter` and `hasBoundaryBefore`, substantially improving worst-case performance. > ICUNormalizer2CharFilter worst case is very slow > ------------------------------------------------ > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Jim Ferenczi > Priority: Minor > Fix For: main (9.0), 8.10 > > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org