On Thu, 25 May 2023 23:54:14 GMT, Andrei Pangin <[email protected]> wrote:
>> UUID is the very important class that is used to track identities of objects
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1%
>> of total CPU time, and is frequently a scalability bottleneck due to
>> `SecureRandom` synchronization.
>>
>> The major issue with UUID code itself is that it reads from the single
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom`
>> is bashed with very small requests. This also has a chilling effect on other
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>>
>> We can improve this by doing the bulk reads from the backing SecureRandom
>> and possibly striping the reads across many instances of it.
>>
>>
>> Benchmark Mode Cnt Score Error Units
>>
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>>
>> # Before
>> UUIDRandomBench.single thrpt 15 3.545 ± 0.058 ops/us
>> UUIDRandomBench.max thrpt 15 1.832 ± 0.059 ops/us ; negative scaling
>>
>> # After
>> UUIDRandomBench.single thrpt 15 4.421 ± 0.047 ops/us
>> UUIDRandomBench.max thrpt 15 6.658 ± 0.092 ops/us ; positive
>> scaling, ~1.5x
>>
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>>
>> # Before
>> UUIDRandomBench.single thrpt 15 2.710 ± 0.038 ops/us
>> UUIDRandomBench.max thrpt 15 1.880 ± 0.029 ops/us ; negative
>> scaling
>>
>> # After
>> Benchmark Mode Cnt Score Error Units
>> UUIDRandomBench.single thrpt 15 3.099 ± 0.022 ops/us
>> UUIDRandomBench.max thrpt 15 3.555 ± 0.062 ops/us ; positive
>> scaling, ~1.2x
>>
>>
>> Note that there is still a scalability bottleneck in current default random
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure
>> out the synchronization story there. The scalability fix in current default
>> `SecureRandom` would be much more intrusive and risky, since it would change
>> a core crypto class with unknown bug fanout.
>>
>> Using the bulk reads even when the underlying PRNG is heavily synchronized
>> is still a win. A more scalable PRNG would benefit from this as well. This
>> PR adds a system property to select the PRNG implementation, and there we
>> can clearly see the benefit with more scalable PRNG sources:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>>
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>>
>> # Before, hacked `new SecureRandom()` to
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single thrpt ...
>
> src/java.base/share/classes/java/util/UUID.java line 286:
>
>> 284: long lsb = 0;
>> 285: for (int i = start; i < start + 8; i++) {
>> 286: msb = (msb << 8) | (data[i] & 0xff);
>
> Can we use VarHandle/ByteBuffer to read 64 bits at a time?
`jdk.internal.util.ByteArray` has VarHandle-based methods ready for that
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206305263