Re: Branchless binary search in Java?

2023-07-27 Thread Uwe Schindler
See Shipilevs blog: 
https://shipilev.net/jvm/anatomy-quarks/30-conditional-moves/

He also has some examples and also there's a command line option to tell 
hotspot when to use cmov: -XX:ConditionalMoveLimit

Uwe

Am 27. Juli 2023 14:43:19 MESZ schrieb Uwe Schindler :
>Hi Mike,
>
>actually Hotspot is using CMOV. Some nodes after bytecode analysis are 
>converted to CMoveNode and it has implementations to generate machine code (as 
>far as i see) for x86, s390 and arm.
>
>The generic code is here: 
>https://github.com/openjdk/jdk/blob/486c7844f902728ce580c3994f58e3e497834952/src/hotspot/share/opto/movenode.cpp
>
>Actually it is used in some cases, but I did not find out when it uses it to 
>generate instructions from bytecode. It also has some code to optimize the 
>cmov away if the result is known before (or not, e.g. for floats it does not 
>do this).
>
>I think best would be to ask on the hotspot compiler list on suggestions how 
>to write Java code to trigger the JVM to insert the CMOV.
>
>Uwe
>
>Am 27.07.2023 um 13:40 schrieb Michael McCandless:
>> Hi Team,
>> 
>> At Amazon (customer facing product search team) we've been playing with / 
>> benchmarking Tantivy (exciting Rust search engine loosely inspired by 
>> Lucene: https://github.com/quickwit-oss/tantivy, created by Paul Masurel and 
>> developed now by Quickwit and the Tantivy open source dev community) vs 
>> Lucene, by building a benchmark that blends Tantivy's search-benchmark-game 
>> (https://github.com/quickwit-oss/search-benchmark-game) and Lucene's nightly 
>> benchmarks (luceneutil: https://github.com/mikemccand/luceneutil and 
>> https://home.apache.org/~mikemccand/lucenebench).
>> 
>> It's great fun, and we would love more eyeballs to spot any remaining unfair 
>> aspects of the comparison (thank you Uwe and Adrien for catching things 
>> already!).  We are trying hard for an apples to apples comparison: same 
>> (enwiki) corpus, same queries, confirming we get precisely the same top N 
>> hits and total hit counts, latest versions of both engines.
>> 
>> Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many 
>> queries, and we've been trying to understand why.
>> 
>> Sometimes it is due to a more efficient algorithms, e.g. the count() API for 
>> pure disjunctive queries, which Adrien is now porting to Lucene (thank 
>> you!), showing sizable (~80% faster in one query) speedups: 
>> https://github.com/apache/lucene/issues/12358. Other times may be due to 
>> Rust's more efficient/immediate/Python-like GC, or direct access to SIMD 
>> (Lucene now has this for aKNN search -- big speedup -- and maybe soon for 
>> postings too?), and unsafe code, different skip data / block postings 
>> encoding, or ...
>> 
>> Specifically, one of the fascinating Tantivy optimizations is the branchless 
>> binary search: https://quickwit.io/blog/search-a-sorted-block. Here's 
>> another blog post about it (implemented in C++): 
>> https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/
>> 
>> The idea is to get rustc/gcc to compile down to x86-64's CMOVcc 
>> ("conditional move") instruction (I'm not sure if ARM has an equivalent?  
>> Maybe "conditional execution"?).  The idea is to avoid a "true" branch of 
>> the instruction stream (costly pipeline flush on mis-prediction, which is 
>> likely in a binary search or priority queue context) by instead 
>> conditionally moving a value from one location to another (register or 
>> memory).  Tantivy uses this for skipping through postings, in a single layer 
>> in-memory skip list structure (versus Lucene's on-disk (memory-mapped, by 
>> default) multi-layer skip list) -- see the above blog post.
>> 
>> This made me wonder: does javac's hotspot compiler use CMOVcc?  I see javac 
>> bug fixes like https://github.com/openjdk/mobile/commit/a03e9220 which seems 
>> to imply C2 does in fact compile to CMOVcc sometimes.  So then I wondered 
>> whether a branchless binary search in Java is a possibility?  Has anyone 
>> played with this?
>> 
>> Before Robert gets too upset :)  Even if we could build such a thing, the 
>> right place for such optimizations is likely the JDK itself (see the similar 
>> discussion about SIMD-optimized sorting: 
>> https://github.com/apache/lucene/issues/12399). Still, I'm curious if anyone 
>> has explored this, and maybe saw some performance gains from way up in 
>> javaland where we can barely see the bare metal shining beneath us :)
>> 
>> Sorry for the long post!
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>
>-- 
>Uwe Schindler
>Achterdiek 19, D-28357 Bremen
>https://www.thetaphi.de
>eMail:u...@thetaphi.de

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Branchless binary search in Java?

2023-07-27 Thread Uwe Schindler

Hi Mike,

actually Hotspot is using CMOV. Some nodes after bytecode analysis are 
converted to CMoveNode and it has implementations to generate machine 
code (as far as i see) for x86, s390 and arm.


The generic code is here: 
https://github.com/openjdk/jdk/blob/486c7844f902728ce580c3994f58e3e497834952/src/hotspot/share/opto/movenode.cpp


Actually it is used in some cases, but I did not find out when it uses 
it to generate instructions from bytecode. It also has some code to 
optimize the cmov away if the result is known before (or not, e.g. for 
floats it does not do this).


I think best would be to ask on the hotspot compiler list on suggestions 
how to write Java code to trigger the JVM to insert the CMOV.


Uwe

Am 27.07.2023 um 13:40 schrieb Michael McCandless:

Hi Team,

At Amazon (customer facing product search team) we've been playing 
with / benchmarking Tantivy (exciting Rust search engine loosely 
inspired by Lucene: https://github.com/quickwit-oss/tantivy, created 
by Paul Masurel and developed now by Quickwit and the Tantivy open 
source dev community) vs Lucene, by building a benchmark that blends 
Tantivy's search-benchmark-game 
(https://github.com/quickwit-oss/search-benchmark-game) and Lucene's 
nightly benchmarks (luceneutil: 
https://github.com/mikemccand/luceneutil and 
https://home.apache.org/~mikemccand/lucenebench).


It's great fun, and we would love more eyeballs to spot any remaining 
unfair aspects of the comparison (thank you Uwe and Adrien for 
catching things already!).  We are trying hard for an apples to apples 
comparison: same (enwiki) corpus, same queries, confirming we get 
precisely the same top N hits and total hit counts, latest versions of 
both engines.


Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many 
queries, and we've been trying to understand why.


Sometimes it is due to a more efficient algorithms, e.g. the count() 
API for pure disjunctive queries, which Adrien is now porting to 
Lucene (thank you!), showing sizable (~80% faster in one query) 
speedups: https://github.com/apache/lucene/issues/12358. Other times 
may be due to Rust's more efficient/immediate/Python-like GC, or 
direct access to SIMD (Lucene now has this for aKNN search -- big 
speedup -- and maybe soon for postings too?), and unsafe code, 
different skip data / block postings encoding, or ...


Specifically, one of the fascinating Tantivy optimizations is the 
branchless binary search: 
https://quickwit.io/blog/search-a-sorted-block. Here's another blog 
post about it (implemented in C++): 
https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/


The idea is to get rustc/gcc to compile down to x86-64's CMOVcc 
("conditional move") instruction (I'm not sure if ARM has an 
equivalent?  Maybe "conditional execution"?).  The idea is to avoid a 
"true" branch of the instruction stream (costly pipeline flush on 
mis-prediction, which is likely in a binary search or priority queue 
context) by instead conditionally moving a value from one location to 
another (register or memory).  Tantivy uses this for skipping through 
postings, in a single layer in-memory skip list structure (versus 
Lucene's on-disk (memory-mapped, by default) multi-layer skip list) -- 
see the above blog post.


This made me wonder: does javac's hotspot compiler use CMOVcc?  I see 
javac bug fixes like https://github.com/openjdk/mobile/commit/a03e9220 
which seems to imply C2 does in fact compile to CMOVcc sometimes.  So 
then I wondered whether a branchless binary search in Java is a 
possibility?  Has anyone played with this?


Before Robert gets too upset :)  Even if we could build such a thing, 
the right place for such optimizations is likely the JDK itself (see 
the similar discussion about SIMD-optimized sorting: 
https://github.com/apache/lucene/issues/12399). Still, I'm curious if 
anyone has explored this, and maybe saw some performance gains from 
way up in javaland where we can barely see the bare metal shining 
beneath us :)


Sorry for the long post!

Mike McCandless

http://blog.mikemccandless.com


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Branchless binary search in Java?

2023-07-27 Thread Michael McCandless
Hi Team,

At Amazon (customer facing product search team) we've been playing with /
benchmarking Tantivy (exciting Rust search engine loosely inspired by
Lucene: https://github.com/quickwit-oss/tantivy, created by Paul Masurel
and developed now by Quickwit and the Tantivy open source dev community) vs
Lucene, by building a benchmark that blends Tantivy's search-benchmark-game
(https://github.com/quickwit-oss/search-benchmark-game) and Lucene's
nightly benchmarks (luceneutil: https://github.com/mikemccand/luceneutil
and https://home.apache.org/~mikemccand/lucenebench).

It's great fun, and we would love more eyeballs to spot any remaining
unfair aspects of the comparison (thank you Uwe and Adrien for catching
things already!).  We are trying hard for an apples to apples comparison:
same (enwiki) corpus, same queries, confirming we get precisely the same
top N hits and total hit counts, latest versions of both engines.

Indeed, Tantivy is substantially (2-3X?) faster than Lucene for many
queries, and we've been trying to understand why.

Sometimes it is due to a more efficient algorithms, e.g. the count() API
for pure disjunctive queries, which Adrien is now porting to Lucene (thank
you!), showing sizable (~80% faster in one query) speedups:
https://github.com/apache/lucene/issues/12358.  Other times may be due to
Rust's more efficient/immediate/Python-like GC, or direct access to SIMD
(Lucene now has this for aKNN search -- big speedup -- and maybe soon for
postings too?), and unsafe code, different skip data / block postings
encoding, or ...

Specifically, one of the fascinating Tantivy optimizations is the
branchless binary search: https://quickwit.io/blog/search-a-sorted-block.
Here's another blog post about it (implemented in C++):
https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/

The idea is to get rustc/gcc to compile down to x86-64's CMOVcc
("conditional move") instruction (I'm not sure if ARM has an equivalent?
Maybe "conditional execution"?).  The idea is to avoid a "true" branch of
the instruction stream (costly pipeline flush on mis-prediction, which is
likely in a binary search or priority queue context) by instead
conditionally moving a value from one location to another (register or
memory).  Tantivy uses this for skipping through postings, in a single
layer in-memory skip list structure (versus Lucene's on-disk
(memory-mapped, by default) multi-layer skip list) -- see the above blog
post.

This made me wonder: does javac's hotspot compiler use CMOVcc?  I see javac
bug fixes like https://github.com/openjdk/mobile/commit/a03e9220 which
seems to imply C2 does in fact compile to CMOVcc sometimes.  So then I
wondered whether a branchless binary search in Java is a possibility?  Has
anyone played with this?

Before Robert gets too upset :)  Even if we could build such a thing, the
right place for such optimizations is likely the JDK itself (see the
similar discussion about SIMD-optimized sorting:
https://github.com/apache/lucene/issues/12399).  Still, I'm curious if
anyone has explored this, and maybe saw some performance gains from way up
in javaland where we can barely see the bare metal shining beneath us :)

Sorry for the long post!

Mike McCandless

http://blog.mikemccandless.com