Hi Anton, It took me a while to get through the blog post, and I suspect I will need to read through a couple more times to understand it fully. Thanks for writing up something so detailed. I learnt a lot! (especially about JVM inlining methods).
> I have not been able to reproduce the speedup with lucenutil - I suspect that the default tasks in it would not trigger this code path that much. Yeah, it seems like luceneutil is not stressing the code path that ElasticSearch's benchmarks are? > I tried changing the DocIdsWriter::readInts32 (and readDelta16), instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the number of virtual calls. In the benchmark setup by Elastic [2] I saw a decrease of execution time of 35-45% for range queries and numerical sorting with this patch applied. So it seems like switching over from an iterative visit(int docID) call to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool! I am running Amazon Product Search's benchmarks to see if the change is needle moving for us. Small suggestion on the blog: The JVM inlining, ElasticSearch short-circuiting/opto causing a difference in performance could've been a blog on its own, part 1 maybe.. I got confused when the blog shifted from the performance differences between ElasticSearch and OpenSearch, to how you ended up improving Lucene. Regards, Gautam Worah. On Fri, Mar 1, 2024 at 2:42 AM Anton Hägerstrand <an...@blunders.io> wrote: > Hi everyone, long time lurker here. > > I recently investigated Elasticsearch/OpenSearch performance in a blog > post [1], and saw some interesting behavior of numerical range queries and > numerical sorting with regards to inlining and virtual calls. > > In short, the DocIdsWriter::readInts method seems to get much slower if it > is called with 3 or more implementations of IntersectVisitor during the JVM > lifetime. I believe that this is due to IntersectVisitory.visit(docid) > being heavily inlined with 2 or fewer IntersectVisitor implementations, > while becoming a virtual call with 3 or more. > > This leads to two interesting points wrt Lucene > > 1) For benchmarks, warm ups should not only be done to trigger speedups by > the JIT, instead making the JVM be in a realistic production state. For the > BKDPointTree, this means at least 3 implementations of the > IntersectVisitor. I'm not sure if this is top of mind when writing Lucene > benchmarks? > 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16), > instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the > number of virtual calls. In the benchmark setup by Elastic [2] I saw a > decrease of execution time of 35-45% for range queries and numerical > sorting with this patch applied. PR: > https://github.com/apache/lucene/pull/13149 > > I have not been able to reproduce the speedup with lucenutil - I suspect > that the default tasks in it would not trigger this code path that much. > > If you want understand more of my line of thinking, consider skimming > through the blog post [1] > > [1] https://blunders.io/posts/es-benchmark-4-inlining > [2] https://github.com/elastic/elasticsearch-opensearch-benchmark > > best regards, > Anton H >