Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Sandhya Viswanathan Thu, 22 Jan 2026 10:04:46 -0800

On Wed, 21 Jan 2026 00:01:39 GMT, Srinivas Vamsi Parasa <[email protected]> 
wrote:


>> @vamsi-parasa Thanks for the extra data!
>> 
>> Do I see this right? In the plots 
>> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), 
>> the masked performance lies lower/better than unmasked performance (here we 
>> measure ns/ops). But in your tables 
>> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) 
>> you are measuring ops/ms, and are getting the opposite trend: masked is 
>> slower than unmasked.
>> 
>> Can you explain the difference between the two results?
>
>> Can you explain the difference between the two results?
>>
> Hi Emanuel (@eme64),
> Yes, the conclusions you mentioned are correct. The store only benchmark 
> shows that masked store is slightly better than the unmasked store. However, 
> the store followed by load benchmarks shows that the unmasked store is better 
> than masked vector store as masked vector stores have very limited store 
> forwarding support in the hardware.
> 
> This is because the load operation following the masked vector store is 
> blocked until the data is written into the cache. This is also mentioned in 
> the [Intel Software optimization 
> manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)
>  (Chapter 18, section 18.4, page 578).
> 
> Pasting the relevant text below for reference:
> 
> 
> 18.4 FORWARDING AND MEMORY MASKING
> When using masked store and load, consider the following:
> • When the mask is not all-ones or all-zeroes, the load operation, following 
> the masked store operation 
> from the same address is blocked, until the data is written to the cache. 
> • Unlike GPR forwarding rules, vector loads whether or not they are masked, 
> do not forward unless 
> load and store addresses are exactly the same.
> — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes
> — st_mask = 00001111, ld_mask = 00000011, can forward: no, should block: yes
> • When the mask is all-ones, blocking does not occur, because the data may be 
> forwarded to the load 
> operation.
> — st_mask = 11111111, ld_mask = don’t care, can forward: yes, should block: no
> • When mask is all-zeroes, blocking does not occur, though neither does 
> forwarding.
> — st_mask = 00000000, ld_mask = don’t care, can forward: no, should block: no
> In summary, a masked store should be used carefully, for example, if the 
> remainder size is known at 
> compile time to be 1, and there is a load operation from the same cache line 
> after it (or there is an 
> overlap in addresses + vector lengths), it may be better to use scalar 
> remainder processing, rather than 
> a masked remainder block.
> 
> 
> Thanks,
> Vamsi

> @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one 
> that shows a regression. How are we to proceed?
> 
> It seems that without loads [#28442 
> (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
>  this patch leads to a regression.
> 
> Only if there is a load from one of the last elements that the `Arrays.fill` 
> stored to with a masked operation do we get a slowdown. Because of missing 
> load-to-store forwarding. If we instead started loading from the first 
> elements, those would probably already be in cache, and we would not have any 
> latency issues, right?
> 
> But is it not rather an edge-case that we load from the last elements 
> immediately after the `Arrays.fill`? At least for longer arrays, it seems an 
> edge case. For short arrays it is probably more likely that we access the 
> last element soon after the fill.
> 
> It does not seem like a trivial decision to me if this patch is an 
> improvement or not. What do you think @vamsi-parasa ?
> 
> @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts 
> here?

@eme64 My thoughts are to go ahead with this PR replacing masked stores with 
scalar tail processing. As we have seen from 
https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
regression in certain scenarios: accessing elements just written or any other 
adjacent data that happens to fall in the masked store range.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3785772087

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Reply via email to