[ 
https://issues.apache.org/jira/browse/LUCENE-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176574#comment-17176574
 ] 

Dawid Weiss commented on LUCENE-9457:
-------------------------------------

It's one of those things that are exciting to debug, take days to complete and 
sometimes never reach any reasonable explanation. :)

> Why is Kuromoji tokenization throughput bimodal?
> ------------------------------------------------
>
>                 Key: LUCENE-9457
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9457
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>
> With the recent accidental regression of Japanese (Kuromoji) tokenization 
> throughput due to exciting FST optimizations, we [added new nightly Lucene 
> benchmarks|https://github.com/mikemccand/luceneutil/issues/64] to measure 
> tokenization throughput for {{JapaneseTokenizer}}: 
> [https://home.apache.org/~mikemccand/lucenebench/analyzers.html]
> It has already been running for ~5-6 weeks now!  But for some reason, it 
> looks bi-modal?  "Normally" it is ~.45 M tokens/sec, but for two data points 
> it dropped down to ~.33 M tokens/sec, which is odd.  It could be hotspot 
> noise maybe?  But would be good to get to the root cause and fix it if 
> possible.
> Hotspot noise that randomly steals ~27% of your tokenization throughput is no 
> good!!
> Or does anyone have any other ideas of what could be bi-modal in Kuromoji?  I 
> don't think [this performance 
> test|https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/TestAnalyzerPerf.java]
>  has any randomness in it...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to