Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Srinivas Vamsi Parasa Tue, 20 Jan 2026 16:07:53 -0800

On Mon, 19 Jan 2026 08:11:19 GMT, Emanuel Peter <[email protected]> wrote:


> Can you explain the difference between the two results?
>
Hi Emanuel (@eme64),
Yes, the conclusions you mentioned are correct. The store only benchmark shows 
that masked store is slightly better than the unmasked store. However, the 
store followed by load benchmarks shows that the unmasked store is better than 
masked vector store as masked vector stores have very limited store forwarding 
support in the hardware.

This is because the load operation following the masked vector store is blocked 
until the data is written into the cache. This is also mentioned in the [Intel 
Software optimization 
manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)
 (Chapter 18, section 18.4, page 578).

Pasting the relevant text below for reference:


18.4 FORWARDING AND MEMORY MASKING
When using masked store and load, consider the following:
• When the mask is not all-ones or all-zeroes, the load operation, following 
the masked store operation 
from the same address is blocked, until the data is written to the cache. 
• Unlike GPR forwarding rules, vector loads whether or not they are masked, do 
not forward unless 
load and store addresses are exactly the same.
— st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes
— st_mask = 00001111, ld_mask = 00000011, can forward: no, should block: yes
• When the mask is all-ones, blocking does not occur, because the data may be 
forwarded to the load 
operation.
— st_mask = 11111111, ld_mask = don’t care, can forward: yes, should block: no
• When mask is all-zeroes, blocking does not occur, though neither does 
forwarding.
— st_mask = 00000000, ld_mask = don’t care, can forward: no, should block: no
In summary, a masked store should be used carefully, for example, if the 
remainder size is known at 
compile time to be 1, and there is a load operation from the same cache line 
after it (or there is an 
overlap in addresses + vector lengths), it may be better to use scalar 
remainder processing, rather than 
a masked remainder block.


Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3775508253

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Reply via email to