On Wed, 21 Jan 2026 00:01:39 GMT, Srinivas Vamsi Parasa <[email protected]> wrote:
>> @vamsi-parasa Thanks for the extra data! >> >> Do I see this right? In the plots >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >> the masked performance lies lower/better than unmasked performance (here we >> measure ns/ops). But in your tables >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) >> you are measuring ops/ms, and are getting the opposite trend: masked is >> slower than unmasked. >> >> Can you explain the difference between the two results? > >> Can you explain the difference between the two results? >> > Hi Emanuel (@eme64), > Yes, the conclusions you mentioned are correct. The store only benchmark > shows that masked store is slightly better than the unmasked store. However, > the store followed by load benchmarks shows that the unmasked store is better > than masked vector store as masked vector stores have very limited store > forwarding support in the hardware. > > This is because the load operation following the masked vector store is > blocked until the data is written into the cache. This is also mentioned in > the [Intel Software optimization > manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) > (Chapter 18, section 18.4, page 578). > > Pasting the relevant text below for reference: > > > 18.4 FORWARDING AND MEMORY MASKING > When using masked store and load, consider the following: > • When the mask is not all-ones or all-zeroes, the load operation, following > the masked store operation > from the same address is blocked, until the data is written to the cache. > • Unlike GPR forwarding rules, vector loads whether or not they are masked, > do not forward unless > load and store addresses are exactly the same. > — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes > — st_mask = 00001111, ld_mask = 00000011, can forward: no, should block: yes > • When the mask is all-ones, blocking does not occur, because the data may be > forwarded to the load > operation. > — st_mask = 11111111, ld_mask = don’t care, can forward: yes, should block: no > • When mask is all-zeroes, blocking does not occur, though neither does > forwarding. > — st_mask = 00000000, ld_mask = don’t care, can forward: no, should block: no > In summary, a masked store should be used carefully, for example, if the > remainder size is known at > compile time to be 1, and there is a load operation from the same cache line > after it (or there is an > overlap in addresses + vector lengths), it may be better to use scalar > remainder processing, rather than > a masked remainder block. > > > Thanks, > Vamsi @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one that shows a regression. How are we to proceed? It seems that without loads https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799, this patch leads to a regression. Only if there is a load from one of the last elements that the `Arrays.fill` stored to with a masked operation do we get a slowdown. Because of missing load-to-store forwarding. If we instead started loading from the first elements, those would probably already be in cache, and we would not have any latency issues, right? But is it not rather an edge-case that we load from the last elements immediately after the `Arrays.fill`? At least for longer arrays, it seems an edge case. For short arrays it is probably more likely that we access the last element soon after the fill. It does not seem like a trivial decision to me if this patch is an improvement or not. What do you think @vamsi-parasa ? @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts here? ------------- PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3776741440
