On Mon, 6 Jan 2025 13:18:50 GMT, Shaojin Wen <[email protected]> wrote:
> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD
> within a register) instead of table lookup. Eliminating the table lookup can
> also avoid the performance degradation problem when the cache misses.
By stepping through the code of `Long.expand`, and substituting in the
constants, I come up with this:
static long expandNibbles(long i){
// Inlined version of Long.expand(i,0x0F0F_0F0F_0F0F_0F0FL)
long t = i << 16;
i = (i & ~0xFFFF00000000L) | (t & 0xFFFF00000000L);
t = i << 8;
i = (i & ~0xFF000000FF0000L) | (t & 0xFF000000FF0000L);
t = i << 4;
i = (i & ~0xF000F000F000F00L) | (t & 0xF000F000F000F00L);
return i & 0x0F0F_0F0F_0F0F_0F0FL;
}
This looks like it might actually do better than *Method 2*. If inlining and
constant folding is happening in the non-intrinsic `Long.expand` I would
imagine it would perform comparably to this.
The non-intrinsified java code should be able to run as quickly as the
hand-inlined one.
I think I've found an issue that prevents the code from being constant-folded
as expected. C2 seems to not do constant-folding of xor nodes.
See https://github.com/openjdk/jdk/pull/23089 for an attempt at addressing this.
There are no XOR nodes in expandNibbles

-------------
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2584577398
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2588342173
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2590840422