On Wed, 17 Jun 2026 18:58:05 GMT, Ferenc Rakoczi <[email protected]> wrote:
>> @ferakocz you could use the `extr` instruction to do what you want here i.e. >> >> // combine 40 + 12 bits into hi result >> __ lsl(hi, hi, montMulP256Shift1); >> __ lsr(tmp, lo, montMulP256Shift2); >> __ orr(hi, hi, tmp); >> // mask off 52 bits of lo result >> __ andr(lo, lo, mask); >> >> can be replaced with >> >> // combine 40 + 12 bits into hi result >> __ extr(hi, hi, low, montMulP256Shift2); >> // mask off 52 bits of lo result >> __ andr(lo, lo, mask); >> >> >> That has the advantage of not requiring you to use `tmp`. > > @adinn : Well, I have changed it, but that caused a small, but consistent > performance drop (~0.7%) in the benchmark on the M1 chip that I measured it > on. Since this PR is for performance, I am inclined to revert this change, if > you agree. Hmm, interesting result. Yes, please stick with the current version. But also update the comment to note that this lsl;lsr;orr sequence out-performs extract. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3431070038
