On Wed, 17 Jun 2026 18:58:05 GMT, Ferenc Rakoczi <[email protected]> wrote:

>> @ferakocz you could use the `extr` instruction to do what you want here i.e.
>> 
>>       // combine 40 + 12 bits into hi result
>>       __ lsl(hi, hi, montMulP256Shift1);
>>       __ lsr(tmp, lo, montMulP256Shift2);
>>       __ orr(hi, hi, tmp);
>>       // mask off 52 bits of lo result
>>       __ andr(lo, lo, mask);
>> 
>> can be replaced with
>> 
>>       // combine 40 + 12 bits into hi result
>>       __ extr(hi, hi, low, montMulP256Shift2);
>>       // mask off 52 bits of lo result
>>       __ andr(lo, lo, mask);
>> 
>> 
>> That has the advantage of not requiring you to use `tmp`.
>
> @adinn : Well, I have changed it, but that caused a small, but consistent 
> performance drop (~0.7%) in the benchmark on the M1 chip that I measured it 
> on. Since this PR is for performance, I am inclined to revert this change, if 
> you agree.

Hmm, interesting result. Yes, please stick with the current version. But also 
update the comment to note that this lsl;lsr;orr sequence out-performs extract.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3431070038

Reply via email to