Small correction: nightly benchmarks currently uses "only" 36 threads.

And indeed the "#profiler_4kb_indexing_1_cpu" is single-threaded indexing!
Why on earth is addAttribute so costly in that case!  Wow, I'm glad I named
those anchor links well lol.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <[email protected]> wrote:

> Looking at the source/stating the obvious, creating a new
> StringTokenString from here only happens under certain conditions:
> * field is indexed
> * field is not tokenized (e.g. not using analyzer, ID field or similar)
> * incoming "reuse" parameter is not a StringTokenStream
>
> What is puzzling to me is that it only seems to hit the 4KB documents.
> If there is an issue here, I'd expect it to have an even higher impact
> for the 1KB documents indexing.
>
> But also the internal reuse of IndexingChain.PerField (which houses
> the reused tokenstream) isn't just per-thread, it is
> per-thread-per-segment, right? So if Mike is indexing with 100
> threads, and flushes 200 times, I'd expect 20k of these things to be
> made. There's a lot going on in the benchmark code for nightly and it
> is tricky for me to try to navigate the various cases (1KB,
> 1KB-with-vectors, 4KB, "deterministic indexing", etc)
>
> On Thu, Oct 21, 2021 at 3:40 AM Adrien Grand <[email protected]> wrote:
> >
> > Hello,
> >
> > I've been looking a bit more carefully at nightly benchmarks recently
> and I'm puzzled by the fact that indexing spends almost 5% of the time on
> AttributeSource#addAttribute. Here is the link.
> >
> > 4.37%         14731
>  org.apache.lucene.util.AttributeSource#addAttribute()
> >                               at
> org.apache.lucene.document.Field$StringTokenStream#()
> >                               at
> org.apache.lucene.document.Field#tokenStream()
> >                               at
> org.apache.lucene.index.IndexingChain$PerField#invert()
> >                               at
> org.apache.lucene.index.IndexingChain#processField()
> >                               at
> org.apache.lucene.index.IndexingChain#processDocument()
> >                               at
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments()
> >                               at
> org.apache.lucene.index.DocumentsWriter#updateDocuments()
> >                               at
> org.apache.lucene.index.IndexWriter#updateDocuments()
> >                               at
> org.apache.lucene.index.IndexWriter#updateDocument()
> >                               at
> org.apache.lucene.index.IndexWriter#addDocument()
> >                               at perf.IndexThreads$IndexThread#run()
> >
> > Given that nightly benchmarks reuse Field instances across documents,
> this should only happen once per thread, so why does it show up as a
> bottleneck in our nightly benchmarks? I tried to reproduce locally, but I'm
> not seeing AttributeSource among top CPU consumers.
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to