RFR: JDK-8180450: secondary_super_cache does not scale well

Andrew Haley Thu, 14 Mar 2024 12:03:14 -0700

This PR is a redesign of subtype checking. 

The implementation of subtype checking in the HotSpot JVM is now twenty years 
old. There have been some performance-related bugs reported, and the only way 
to fix them is a redesign of the way it works.


So what's changed, so that the old design should be replaced?

Firstly, the computers of today aren't the computers of twenty years ago. It's 
not merely a matter of speed: the systems are much more parallel, both in the 
sense of having more cores and each core can run many instructions in parallel. 
Because of this, the speed ratio between memory accesses and the rate at which 
we can execute instructions has become wider and wider.

The most severe reported problem is to do with the "secondary supers cache". 
This is a 1-element per-class cache for interfaces (and arrays of interfaces). 
Unfortunately, if two threads repeatedly update this cache, the result is that 
a cache line ping-pongs between cores, causing a severe slowdown. 

Also, the linear search for an interface that is absent means that the entire 
list of interfaces has to be scanned. This plays badly with newer language 
features such as JEP 406, pattern matching for switch.

However, the computers of today can help us. The very high 
instruction-per-cycle rate of a Great Big Out-Of-Order (GBOOO) processor allows 
us to execute many of the instructions of a hash table lookup in parallel, as 
long as we avoid dependencies between instructions. 

The solution
------------

We use a hashed lookup of secondary supers. This is a 64-way hash table, with 
linear probing for collisions. The table is compressed, in that null entries 
are removed, and the resulting hash table fits into the same secondary supers 
array as today's unsorted array of secondary supers. This means that existing 
code in HotSpot that simply does a linear scan of the secondary supers array 
does not need to be altered.

We add a bitmap field to each Klass object. This bitmap contains an occupancy 
bit corresponding to each element of the hash table, with a 1 indicating 
element presence. As well as allowing the hash table to be decompressed, this 
bimap is used as a simple kind of Bloom Filter. To determine whether a 
superclass is present, we simply have to check a single bit in the bitmap. If 
the bit is clear, we know that the superclass is not present. If the bit is 
set, we have to do a little arithmetic and then consult the hash table.

It works like this:


       mov    sub_klass, [& sub_klass->_bitmap]
       mov    array_index, sub_klass
       shl    array_index, <hash code of class we're looking for>
  ╭    jns    not_found 
  │    popcnt array_index,array_index
  │    mov    array_base, [& sub_klass->_secondary_supers]
  │    cmp    super_klass, [array_base+array_index*8]
  │    je     found


The popcount instruction returns the cardinality of a bitset. By shifting out 
the bits higher in number than the hash code of the element we're looking for, 
we leave only the lower bits. `popcount`, then, gives us the index of the 
element we're looking for.

If we don't get a match at the first attempt, we test the next bit in the 
bitset and jump to a fallback stub:


  │    bt     sub_klass, <next hash code>
  │╭   jae    not_found
  ││   ror    sub_klass, <hash code>
  ││   call   Stub::klass_subtype_fallback
  ││   je     found
  ↘↘


Collisions are rare. Vladimir Ivanov did a survey of Java benchmark code, and 
it is very rare to see more than about 20 super-interfaces, and even that gives 
us a collision rate of only about 0.25.

The time taken for a positive lookup is somewhere between 3 - 6 cycles, or 
about 0.9 - 1.8 ns. This is a robust figure, confirmed across current AArch64 
and x86 designs, and this rate can be sustained indefinitely. Negative lookups 
are slightly faster because there's usually no need to consult the secondary 
super cache, at about 3 - 5 cycles.

The current secondary super cache lookup is usually slightly faster than a hash 
table for positive lookups, at about 3 - 4 cycles, but it performs badly with 
negative lookups, and unless the class you're looking for is in the secondary 
super cache it performs badly as well. For example, a negative lookup in a 
class with 4 interfaces takes 10 - 47 cycles.

Limitations and disadvantages
-----------------------------

This PR is a subset of what is possible. It is only implemented for C2, in the 
cases where the superclass is a constant, known at compile time. Given that 
this is almost all uses of subtype checking, it's not really a restriction.

I would have liked to save the hashed interfaces list in the CDS archive file, 
but there are test cases for the Serviceability Agent that assume the 
interfaces are in the order in which they are declared. I think the tests are 
probably just wrong about this. For the same reason, I have to make copies of 
the interfaces array and sort the copies, rather than just using the same array 
at runtime.

I haven't removed the secondary super cache. It's still used by the interpreter 
and C2. 

In the future I'd like to delete the secondary super cache, but it is used in 
many places across the VM. Today is not that day.

Performance testing
-------------------

Hashing the secondary supers arrays takes a little time. I've added a perf 
counter for this, so you can see. It's only really a few milliseconds added to 
startup.

I have run Renaissance and SPECjvm, and whatever speed differences there may be 
are immeasurably small, down in the noise. The benchmark in this class allows 
you to isolate single instanceof invocations with various sets of secondary 
superclasses.

Finally
-------

Vladimir Ivanov was very generous with his time and his advice. He explained 
the problem and went into detail about his experiments, and shared with me his 
experimental code. This saved me a great deal of time in this work.

-------------

Commit messages:
 - Copyright dates
 - Merge branch 'JDK-8180450' of https://github.com/theRealAph/jdk into 
JDK-8180450
 - JDK-8180450: secondary_super_cache does not scale well
 - JDK-8180450: secondary_super_cache does not scale well
 - JDK-8180450: secondary_super_cache does not scale well
 - Merge branch 'JDK-8180450' of https://github.com/theRealAph/jdk into 
JDK-8180450
 - JDK-8180450: secondary_super_cache does not scale well
 - Comment typo
 - JDK-8180450: secondary_super_cache does not scale well
 - JDK-8180450: secondary_super_cache does not scale well
 - ... and 38 more: https://git.openjdk.org/jdk/compare/2482a505...a0677b71

Changes: https://git.openjdk.org/jdk/pull/18309/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18309&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8180450
  Stats: 1314 lines in 29 files changed: 1288 ins; 0 del; 26 mod
  Patch: https://git.openjdk.org/jdk/pull/18309.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18309/head:pull/18309

PR: https://git.openjdk.org/jdk/pull/18309

RFR: JDK-8180450: secondary_super_cache does not scale well

Reply via email to