On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa <[email protected]> 
wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Update ALL of ArraysFill JMH micro
>
> ### Int VectorBulkOperationsArray Fill
> 
> Benchmark   (ns/op) | Size | -OptimizeFill        (JITed code) | 
> +OptimizeFill        (Masked store) | +OptimizeFill        (Unmasked store - 
> This PR) | Delta      (masked vs. unmasked)
> -- | -- | -- | -- | -- | --
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | 
> 0.655 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | 
> 2.827 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | 
> 2.942 | 12%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | 
> 3.094 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | 
> 2.852 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | 
> 3.158 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | 
> 3.118 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | 
> 3.332 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | 
> 2.832 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 
> | 7%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | 
> 3.274 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | 
> 3.521 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | 
> 3.12 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | 
> 3.499 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | 
> 3.339 | 15%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | 
> 3.524 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | 
> 3.101 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | 
> 3.111 | 6%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | 
> 3.358 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | 
> 3.583 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | 
> 3.272 | 11%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | 
> 3.598 | 14%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | 
> 3.481 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | 
> 3.761 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | 
> 3.205 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | 
> 3.387 | 8%
> VectorBulkOperationsArray.fill_var_...

> > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one 
> > that shows a regression. How are we to proceed?
> > It seems that without loads [#28442 
> > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
> >  this patch leads to a regression.
> > Only if there is a load from one of the last elements that the 
> > `Arrays.fill` stored to with a masked operation do we get a slowdown. 
> > Because of missing load-to-store forwarding. If we instead started loading 
> > from the first elements, those would probably already be in cache, and we 
> > would not have any latency issues, right?
> > But is it not rather an edge-case that we load from the last elements 
> > immediately after the `Arrays.fill`? At least for longer arrays, it seems 
> > an edge case. For short arrays it is probably more likely that we access 
> > the last element soon after the fill.
> > It does not seem like a trivial decision to me if this patch is an 
> > improvement or not. What do you think @vamsi-parasa ?
> > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts 
> > here?
> 
> @eme64 My thoughts are to go ahead with this PR replacing masked stores with 
> scalar tail processing. As we have seen from 
> https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
> regression in certain scenarios: accessing elements just written or any other 
> adjacent data that happens to fall in the masked store range.

@sviswa7 But once this PR is integrated, I could file a performance regression 
with the benchmarks from [up 
here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So 
what's the argument which choice is better, since we have a mix of 
speedups/regression going either way, and both are probably in the 10-20% range?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3788945532

Reply via email to