Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

Jatin Bhateja Mon, 21 Mar 2022 11:30:28 -0700

On Mon, 21 Mar 2022 17:56:22 GMT, Quan Anh Mai <d...@openjdk.java.net> wrote:


>> constant and register to register moves are never issued to execution ports, 
>>  rematerializing value rather than reading from memory will give better 
>> performance.
>
> I have come across this a little bit. While `movl r, i` may not consume 
> execution ports, `movq x, r` and `vbroadcastss x, x` surely do. This leads to 
> 3 retired and 2 executed uops. Furthermore, both `movq x, r` and 
> `vbroadcastss x, x` can only run on port 5, limit the throughput of the 
> operation. On the contrary, a `vbroadcastss x, m` only results in 1 retired 
> and 1 executed uop, reducing pressure on the decoder and the backend. A 
> `vbroadcastss x, m` can run on both port 2 and port 3, offering a much better 
> throughput. Latency is not much of a concern in this circumstance since the 
> operation does not have any input dependency.
> 
>> register to register moves are never issued to execution ports
> 
> I believe you misremembered this part, a register to register move is only 
> elided when the registers are of the same kind, `vmovq x, r` would result in 
> 1 uop being executed on port 5.
> 
> What do you think? Thank you very much.

A read from constant table will incur minimum of L1I access penalty to access 
code blob or at worst even more if data is not present in first level cache. 
Change was done for replace vpbroadcastd with vbroadcastss because of two 
reasons.
1) vbroadcastss works at AVX=1 level where as vpbroadcastd need AVX2 feature. 
2) We can avoid extra cycle penalty due to two domain switchovers (FP -> INT 
and then from INT-> FP).

-------------

PR: https://git.openjdk.java.net/jdk/pull/7094

Re: RFR: 8279508: Auto-vectorize Math.round API [v15]

Reply via email to