Hi Anton,

It took me a while to get through the blog post, and I suspect I will need
to read through a couple more times to understand it fully.
Thanks for writing up something so detailed. I learnt a lot! (especially
about JVM inlining methods).

> I have not been able to reproduce the speedup with lucenutil - I suspect
that the default tasks in it would not trigger this code path that much.

Yeah, it seems like luceneutil is not stressing the code path that
ElasticSearch's benchmarks are?

> I tried changing the DocIdsWriter::readInts32 (and readDelta16), instead
calling the IntersectVisitor with a DocIdSetItorator, to reduce the number
of virtual calls. In the benchmark setup by Elastic [2] I saw a decrease of
execution time of 35-45% for range queries and numerical sorting with this
patch applied.

So it seems like switching over from an iterative visit(int docID) call to
a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!

I am running Amazon Product Search's benchmarks to see if the change is
needle moving for us.

Small suggestion on the blog: The JVM inlining, ElasticSearch
short-circuiting/opto causing a difference in performance could've been a
blog on its own, part 1 maybe.. I got confused when the blog shifted from
the performance differences between ElasticSearch and OpenSearch, to how
you ended up improving Lucene.

Regards,
Gautam Worah.


On Fri, Mar 1, 2024 at 2:42 AM Anton Hägerstrand <an...@blunders.io> wrote:

> Hi everyone, long time lurker here.
>
> I recently investigated Elasticsearch/OpenSearch performance in a blog
> post [1], and saw some interesting behavior of numerical range queries and
> numerical sorting with regards to inlining and virtual calls.
>
> In short, the DocIdsWriter::readInts method seems to get much slower if it
> is called with 3 or more implementations of IntersectVisitor during the JVM
> lifetime. I believe that this is due to IntersectVisitory.visit(docid)
> being heavily inlined with 2 or fewer IntersectVisitor implementations,
> while becoming a virtual call with 3 or more.
>
> This leads to two interesting points wrt Lucene
>
> 1) For benchmarks, warm ups should not only be done to trigger speedups by
> the JIT, instead making the JVM be in a realistic production state. For the
> BKDPointTree, this means at least 3 implementations of the
> IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
> benchmarks?
> 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
> decrease of execution time of 35-45% for range queries and numerical
> sorting with this patch applied. PR:
> https://github.com/apache/lucene/pull/13149
>
> I have not been able to reproduce the speedup with lucenutil - I suspect
> that the default tasks in it would not trigger this code path that much.
>
> If you want understand more of my line of thinking, consider skimming
> through the blog post [1]
>
> [1] https://blunders.io/posts/es-benchmark-4-inlining
> [2] https://github.com/elastic/elasticsearch-opensearch-benchmark
>
> best regards,
> Anton H
>

Reply via email to