On Mon, 15 Jun 2026 14:10:14 GMT, Andrew Dinn <[email protected]> wrote:

>> Added the comments, but as for clarity of bfm, it is one less instruction, 
>> but to me it is not as intuitive as the shift and or.
>
> @ferakocz you could use the `extr` instruction to do what you want here i.e.
> 
>       // combine 40 + 12 bits into hi result
>       __ lsl(hi, hi, montMulP256Shift1);
>       __ lsr(tmp, lo, montMulP256Shift2);
>       __ orr(hi, hi, tmp);
>       // mask off 52 bits of lo result
>       __ andr(lo, lo, mask);
> 
> can be replaced with
> 
>       // combine 40 + 12 bits into hi result
>       __ extr(hi, hi, low, montMulP256Shift2);
>       // mask off 52 bits of lo result
>       __ andr(lo, lo, mask);
> 
> 
> That has the advantage of not requiring you to use `tmp`.

@adinn : Well, I have changed it, but that caused a small, but consistent 
performance drop (~0.7%) in the benchmark on the M1 chip that I measured it on. 
Since this PR is for performance, I am inclined to revert this change, if you 
agree.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3430734670

Reply via email to