On Mon, 2 Sep 2024 09:32:56 GMT, Maurizio Cimadamore <[email protected]>
wrote:
>> If I run:
>>
>>
>> @Benchmark
>> public long shift() {
>> return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 |
>> ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 |
>> ELEM_SIZE;
>> }
>>
>> @Benchmark
>> public long mul() {
>> return ELEM_SIZE * 0xFFFF_FFFF_FFFFL;
>> }
>>
>> Then I get:
>>
>> Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
>> TestFill.mul 31 avgt 30 0.586 ? 0.045 ns/op
>> TestFill.shift 31 avgt 30 0.938 ? 0.017 ns/op
>>
>> On my M1 machine.
>
> I found similar small improvements to be had (I wrote about them offline)
> when replacing the bitwise-based tests (e.g. `foo & 4 != 0`) with a more
> explicit check for `remainingBytes >=4`. Seems like bitwise operations are
> not as optimized (or perhaps the assembly instructions for them is overall
> more convoluted - I haven't checked).
I've tried
final long longValue = Byte.toUnsignedLong(value) * 0x0101010101010101L;
But it had the same performance as explicit bit shifting on M1.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1741664877