Small correction: nightly benchmarks currently uses "only" 36 threads.
And indeed the "#profiler_4kb_indexing_1_cpu" is single-threaded indexing! Why on earth is addAttribute so costly in that case! Wow, I'm glad I named those anchor links well lol. Mike McCandless http://blog.mikemccandless.com On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <[email protected]> wrote: > Looking at the source/stating the obvious, creating a new > StringTokenString from here only happens under certain conditions: > * field is indexed > * field is not tokenized (e.g. not using analyzer, ID field or similar) > * incoming "reuse" parameter is not a StringTokenStream > > What is puzzling to me is that it only seems to hit the 4KB documents. > If there is an issue here, I'd expect it to have an even higher impact > for the 1KB documents indexing. > > But also the internal reuse of IndexingChain.PerField (which houses > the reused tokenstream) isn't just per-thread, it is > per-thread-per-segment, right? So if Mike is indexing with 100 > threads, and flushes 200 times, I'd expect 20k of these things to be > made. There's a lot going on in the benchmark code for nightly and it > is tricky for me to try to navigate the various cases (1KB, > 1KB-with-vectors, 4KB, "deterministic indexing", etc) > > On Thu, Oct 21, 2021 at 3:40 AM Adrien Grand <[email protected]> wrote: > > > > Hello, > > > > I've been looking a bit more carefully at nightly benchmarks recently > and I'm puzzled by the fact that indexing spends almost 5% of the time on > AttributeSource#addAttribute. Here is the link. > > > > 4.37% 14731 > org.apache.lucene.util.AttributeSource#addAttribute() > > at > org.apache.lucene.document.Field$StringTokenStream#() > > at > org.apache.lucene.document.Field#tokenStream() > > at > org.apache.lucene.index.IndexingChain$PerField#invert() > > at > org.apache.lucene.index.IndexingChain#processField() > > at > org.apache.lucene.index.IndexingChain#processDocument() > > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() > > at > org.apache.lucene.index.DocumentsWriter#updateDocuments() > > at > org.apache.lucene.index.IndexWriter#updateDocuments() > > at > org.apache.lucene.index.IndexWriter#updateDocument() > > at > org.apache.lucene.index.IndexWriter#addDocument() > > at perf.IndexThreads$IndexThread#run() > > > > Given that nightly benchmarks reuse Field instances across documents, > this should only happen once per thread, so why does it show up as a > bottleneck in our nightly benchmarks? I tried to reproduce locally, but I'm > not seeing AttributeSource among top CPU consumers. > > > > -- > > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
