On Sun, 13 Mar 2022 04:27:44 GMT, Jatin Bhateja <[email protected]> wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4178: >> >>> 4176: movl(scratch, 1056964608); >>> 4177: movq(xtmp1, scratch); >>> 4178: vbroadcastss(xtmp1, xtmp1, vec_enc); >> >> You could put the constant in the constant table and use `vbroadcastss` here >> also. >> >> Thank you very much. > > constant and register to register moves are never issued to execution ports, > rematerializing value rather than reading from memory will give better > performance. I have come across this a little bit. While `movl r, i` may not consume execution ports, `movq x, r` and `vbroadcastss x, x` surely do. This leads to 3 retired and 2 executed uops. Furthermore, both `movq x, r` and `vbroadcastss x, x` can only run on port 5, limit the throughput of the operation. On the contrary, a `vbroadcastss x, m` only results in 1 retired and 1 executed uop, reducing pressure on the decoder and the backend. A `vbroadcastss x, m` can run on both port 2 and port 3, offering a much better throughput. Latency is not much of a concern in this circumstance since the operation does not have any input dependency. > register to register moves are never issued to execution ports I believe you misremembered this part, a register to register move is only elided when the registers are of the same kind, `vmovq x, r` would result in 1 uop being executed on port 5. What do you think? Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/7094
