On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa <[email protected]> wrote:
>> Srinivas Vamsi Parasa has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Update ALL of ArraysFill JMH micro > > ### Int VectorBulkOperationsArray Fill > > Benchmark (ns/op) | Size | -OptimizeFill (JITed code) | > +OptimizeFill (Masked store) | +OptimizeFill (Unmasked store - > This PR) | Delta (masked vs. unmasked) > -- | -- | -- | -- | -- | -- > VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | > 0.655 | 1% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | > 2.827 | 1% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | > 2.942 | 12% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | > 3.094 | 16% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | > 2.852 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | > 3.158 | 18% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | > 3.118 | 17% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | > 3.332 | 22% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | > 2.832 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 > | 7% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | > 3.274 | 13% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | > 3.521 | 19% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | > 3.12 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | > 3.499 | 17% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | > 3.339 | 15% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | > 3.524 | 19% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | > 3.101 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | > 3.111 | 6% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | > 3.358 | 13% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | > 3.583 | 18% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | > 3.272 | 11% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | > 3.598 | 14% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | > 3.481 | 16% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | > 3.761 | 22% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | > 3.205 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | > 3.387 | 8% > VectorBulkOperationsArray.fill_var_... > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one > > that shows a regression. How are we to proceed? > > It seems that without loads [#28442 > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), > > this patch leads to a regression. > > Only if there is a load from one of the last elements that the > > `Arrays.fill` stored to with a masked operation do we get a slowdown. > > Because of missing load-to-store forwarding. If we instead started loading > > from the first elements, those would probably already be in cache, and we > > would not have any latency issues, right? > > But is it not rather an edge-case that we load from the last elements > > immediately after the `Arrays.fill`? At least for longer arrays, it seems > > an edge case. For short arrays it is probably more likely that we access > > the last element soon after the fill. > > It does not seem like a trivial decision to me if this patch is an > > improvement or not. What do you think @vamsi-parasa ? > > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts > > here? > > @eme64 My thoughts are to go ahead with this PR replacing masked stores with > scalar tail processing. As we have seen from > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big > regression in certain scenarios: accessing elements just written or any other > adjacent data that happens to fall in the masked store range. @sviswa7 But once this PR is integrated, I could file a performance regression with the benchmarks from [up here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So what's the argument which choice is better, since we have a mix of speedups/regression going either way, and both are probably in the 10-20% range? ------------- PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3788945532
