[ https://issues.apache.org/jira/browse/LUCENE-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176574#comment-17176574 ]
Dawid Weiss commented on LUCENE-9457: ------------------------------------- It's one of those things that are exciting to debug, take days to complete and sometimes never reach any reasonable explanation. :) > Why is Kuromoji tokenization throughput bimodal? > ------------------------------------------------ > > Key: LUCENE-9457 > URL: https://issues.apache.org/jira/browse/LUCENE-9457 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > > With the recent accidental regression of Japanese (Kuromoji) tokenization > throughput due to exciting FST optimizations, we [added new nightly Lucene > benchmarks|https://github.com/mikemccand/luceneutil/issues/64] to measure > tokenization throughput for {{JapaneseTokenizer}}: > [https://home.apache.org/~mikemccand/lucenebench/analyzers.html] > It has already been running for ~5-6 weeks now! But for some reason, it > looks bi-modal? "Normally" it is ~.45 M tokens/sec, but for two data points > it dropped down to ~.33 M tokens/sec, which is odd. It could be hotspot > noise maybe? But would be good to get to the root cause and fix it if > possible. > Hotspot noise that randomly steals ~27% of your tokenization throughput is no > good!! > Or does anyone have any other ideas of what could be bi-modal in Kuromoji? I > don't think [this performance > test|https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/TestAnalyzerPerf.java] > has any randomness in it... -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org