On Wed, 21 Jan 2026 00:01:39 GMT, Srinivas Vamsi Parasa <[email protected]> wrote:
>> @vamsi-parasa Thanks for the extra data! >> >> Do I see this right? In the plots >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >> the masked performance lies lower/better than unmasked performance (here we >> measure ns/ops). But in your tables >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) >> you are measuring ops/ms, and are getting the opposite trend: masked is >> slower than unmasked. >> >> Can you explain the difference between the two results? > >> Can you explain the difference between the two results? >> > Hi Emanuel (@eme64), > Yes, the conclusions you mentioned are correct. The store only benchmark > shows that masked store is slightly better than the unmasked store. However, > the store followed by load benchmarks shows that the unmasked store is better > than masked vector store as masked vector stores have very limited store > forwarding support in the hardware. > > This is because the load operation following the masked vector store is > blocked until the data is written into the cache. This is also mentioned in > the [Intel Software optimization > manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) > (Chapter 18, section 18.4, page 578). > > Pasting the relevant text below for reference: > > > 18.4 FORWARDING AND MEMORY MASKING > When using masked store and load, consider the following: > • When the mask is not all-ones or all-zeroes, the load operation, following > the masked store operation > from the same address is blocked, until the data is written to the cache. > • Unlike GPR forwarding rules, vector loads whether or not they are masked, > do not forward unless > load and store addresses are exactly the same. > — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes > — st_mask = 00001111, ld_mask = 00000011, can forward: no, should block: yes > • When the mask is all-ones, blocking does not occur, because the data may be > forwarded to the load > operation. > — st_mask = 11111111, ld_mask = don’t care, can forward: yes, should block: no > • When mask is all-zeroes, blocking does not occur, though neither does > forwarding. > — st_mask = 00000000, ld_mask = don’t care, can forward: no, should block: no > In summary, a masked store should be used carefully, for example, if the > remainder size is known at > compile time to be 1, and there is a load operation from the same cache line > after it (or there is an > overlap in addresses + vector lengths), it may be better to use scalar > remainder processing, rather than > a masked remainder block. > > > Thanks, > Vamsi > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one > that shows a regression. How are we to proceed? > > It seems that without loads [#28442 > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), > this patch leads to a regression. > > Only if there is a load from one of the last elements that the `Arrays.fill` > stored to with a masked operation do we get a slowdown. Because of missing > load-to-store forwarding. If we instead started loading from the first > elements, those would probably already be in cache, and we would not have any > latency issues, right? > > But is it not rather an edge-case that we load from the last elements > immediately after the `Arrays.fill`? At least for longer arrays, it seems an > edge case. For short arrays it is probably more likely that we access the > last element soon after the fill. > > It does not seem like a trivial decision to me if this patch is an > improvement or not. What do you think @vamsi-parasa ? > > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts > here? @eme64 My thoughts are to go ahead with this PR replacing masked stores with scalar tail processing. As we have seen from https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big regression in certain scenarios: accessing elements just written or any other adjacent data that happens to fall in the masked store range. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3785772087
