On Fri, 30 Jan 2026 08:33:57 GMT, Emanuel Peter <[email protected]> wrote:
>>> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and >>> > > one that shows a regression. How are we to proceed? >>> > > It seems that without loads [#28442 >>> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >>> > > this patch leads to a regression. >>> > > Only if there is a load from one of the last elements that the >>> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. >>> > > Because of missing load-to-store forwarding. If we instead started >>> > > loading from the first elements, those would probably already be in >>> > > cache, and we would not have any latency issues, right? >>> > > But is it not rather an edge-case that we load from the last elements >>> > > immediately after the `Arrays.fill`? At least for longer arrays, it >>> > > seems an edge case. For short arrays it is probably more likely that we >>> > > access the last element soon after the fill. >>> > > It does not seem like a trivial decision to me if this patch is an >>> > > improvement or not. What do you think @vamsi-parasa ? >>> > > @sviswa7 @dwhite-intel You already approved this PR. What are your >>> > > thoughts here? >>> > >>> > >>> > @eme64 My thoughts are to go ahead with this PR replacing masked stores >>> > with scalar tail processing. As we have seen from >>> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big >>> > regression in certain scenarios: accessing elements just written or any >>> > other adjacent data that happens to fall in the masked store range. >>> >>> @sviswa7 But once this PR is integrated, I could file a performance >>> regression with the benchmarks from [up >>> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). >>> So what's the argument which choice is better, since we have a mix of >>> speedups/regression going either way, and both are probably in the 10-20% >>> range? >> >> @eme64 You have a point there, but if you see the performance numbers for >> ByteMatrix.java (from JDK-8349452) in the PR description above we are >> talking about a recovery of 3x or so. The ByteMatrix.java is doing only >> Arrays.fill() on individual arrays of a 2D array. The individual arrays >> happened to be allocated alongside each other by the JVM and the next store >> sees stalls due to the masked store of previous array initialization. That >> was the reason to look for a solution without masked stores. > > @sviswa7 Ah right, the ByteMatrix.java is yet another case. There, we don't > seem to have any loads. > >> The individual arrays happened to be allocated alongside each other by the >> JVM and the next store sees stalls due to the masked store of previous array >> initialization. > > Ah, that sounds interesting! Is there some tool that would let me see that it > was due to masked store stalls? > My (probably uneducated) guess would have been that it is just because a > single element store would be much cheaper than a masked operation. If you > only access a single or 2 elements, then a masked store is not yet > profitable. What if the masked stores were a bit further away, say a > cacheline away? Would that be significantly faster, because there are no > stalls? Or still slow because of the inherent higher cost of masked > operations? > > If we take the ByteMatrix.java benchmark: how would the performance change if > we increase the size of the arrays (height)? Is there some height before > which the non-masked solution is faster, and after which the masked is faster? > > Would it be a solution to use scalar stores for very small arrays, and only > use the masked loop starting at a certain threshold? > > ----------------------- > > I would like to see a summary of all the benchmarks we have here, and in > which cases we get speedups/slowdowns, and for which reason. Maybe listing > those reasons lets us see some third option we did not yet consider. And > listing all the reasons and code shapes may help us find out which shapes we > care about most, and then come to a decision that weighs off the pros and > cons. > > We should also document our decision nicely in the code, so that if someone > gets a regression in the future, we can see if we had already considered that > code shape. > > Does that make sense? Or do you have a better idea how to make a good > decision here? Hi Emanuel (@eme64), Based on the discussion, I will run further experiments to see if the regressions can be addressed and get back to you at a later date. Thanks, Vamsi ------------- PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3825349231
