On Mon, 15 Jun 2026 14:10:14 GMT, Andrew Dinn <[email protected]> wrote:
>> Added the comments, but as for clarity of bfm, it is one less instruction, >> but to me it is not as intuitive as the shift and or. > > @ferakocz you could use the `extr` instruction to do what you want here i.e. > > // combine 40 + 12 bits into hi result > __ lsl(hi, hi, montMulP256Shift1); > __ lsr(tmp, lo, montMulP256Shift2); > __ orr(hi, hi, tmp); > // mask off 52 bits of lo result > __ andr(lo, lo, mask); > > can be replaced with > > // combine 40 + 12 bits into hi result > __ extr(hi, hi, low, montMulP256Shift2); > // mask off 52 bits of lo result > __ andr(lo, lo, mask); > > > That has the advantage of not requiring you to use `tmp`. @adinn : Well, I have changed it, but that caused a small, but consistent performance drop (~0.7%) in the benchmark on the M1 chip that I measured it on. Since this PR is for performance, I am inclined to revert this change, if you agree. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3430734670
