On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa <[email protected]> 
wrote:

>> The goal of this PR is to fix the performance regression in Arrays.fill() 
>> x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX 
>> stores with store instructions without masks (i.e. unmasked stores). 
>> `fill32_masked()` and `fill64_masked()` stubs are replaced with 
>> `fill32_unmasked()` and `fill64_unmasked()` respectively.
>> 
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down 
>> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>> 
>> 
>> ### **Performance comparison for byte array fills in a loop for 1 million 
>> times**
>> 
>> 
>> UseAVX=3   ByteArray Size | +OptimizeFill    (Masked store   stub)     
>> [secs] | -OptimizeFill   (No stub)   [secs] | --->This PR: +OptimizeFill   
>> (Unmasked store   stub)   [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.189
>> 2 | 0.46 | 0.16 | 0.191
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.212
>> 5 | 0.46 | 0.29 | 0.364
>> 10 | 0.46 | 0.58 | 0.354
>> 15 | 0.46 | 0.42 | 0.325
>> 16 | 0.46 | 0.46 | 0.281
>> 17 | 0.21 | 0.5 | 0.365
>> 20 | 0.21 | 0.37 | 0.326
>> 25 | 0.21 | 0.59 | 0.343
>> 31 | 0.21 | 0.53 | 0.317
>> 32 | 0.21 | 0.58 | 0.249
>> 35 | 0.5 | 0.77 | 0.303
>> 40 | 0.5 | 0.61 | 0.312
>> 45 | 0.5 | 0.52 | 0.364
>> 48 | 0.5 | 0.66 | 0.283
>> 49 | 0.22 | 0.69 | 0.367
>> 50 | 0.22 | 0.78 | 0.344
>> 55 | 0.22 | 0.67 | 0.332
>> 60 | 0.22 | 0.67 | 0.312
>> 64 | 0.22 | 0.82 | 0.253
>> 70 | 0.51 | 1.1 | 0.394
>> 80 | 0.49 | 0.89 | 0.346
>> 90 | 0.225 | 0.68 | 0.385
>> 100 | 0.54 | 1.09 | 0.364
>> 110 | 0.6 | 0.98 | 0.416
>> 120 | 0.26 | 0.75 | 0.367
>> 128 | 0.266 | 1.1 | 0.342
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Update ALL of ArraysFill JMH micro

Also, we can see the benefit of using unmasked stores (this PR) instead of 
masked vector stores (existing implementation) when we update the 
ArraysFill.java JMH micro-benchmark to perform fill (write) followed by read of 
the filled data as shown below using short array fill as an example:


@Benchmark
    public short testShortFill() {
        Arrays.fill(testShortArray, (short) -1);
        return (short) (testShortArray[0] + testShortArray[size - 1]);
    }





**(Higher is better)** 
Benchmark   (ops/ms)     MaxVectorSize = 32 | SIZE | +OptimizeFill     (Masked 
Store) | +OptimizeFill     (Unmasked Store - This PR) | Delta
-- | -- | -- | -- | --
ArraysFill.testByteFill | 1 | 175381 | 342456 | 95%
ArraysFill.testByteFill | 10 | 175421 | 264607 | 51%
ArraysFill.testByteFill | 20 | 175447 | 271111 | 55%
ArraysFill.testByteFill | 30 | 175454 | 253351 | 44%
ArraysFill.testByteFill | 40 | 162429 | 273043 | 68%
ArraysFill.testByteFill | 50 | 162443 | 251734 | 55%
ArraysFill.testByteFill | 60 | 162454 | 248156 | 53%
ArraysFill.testByteFill | 70 | 156659 | 236497 | 51%
ArraysFill.testByteFill | 80 | 175403 | 269433 | 54%
ArraysFill.testByteFill | 90 | 175422 | 230276 | 31%
ArraysFill.testByteFill | 100 | 168662 | 252394 | 50%
ArraysFill.testByteFill | 110 | 146182 | 217917 | 49%
ArraysFill.testByteFill | 120 | 168693 | 239126 | 42%
ArraysFill.testByteFill | 130 | 162378 | 166159 | 2%
ArraysFill.testByteFill | 140 | 156569 | 168296 | 7%
ArraysFill.testByteFill | 150 | 151214 | 167388 | 11%
ArraysFill.testByteFill | 160 | 156594 | 173529 | 11%
ArraysFill.testByteFill | 170 | 156590 | 167976 | 7%
ArraysFill.testByteFill | 180 | 156561 | 173015 | 11%
ArraysFill.testByteFill | 190 | 156601 | 173073 | 11%
ArraysFill.testByteFill | 200 | 168277 | 174293 | 4%
ArraysFill.testIntFill | 1 | 175403 | 334460 | 91%
ArraysFill.testIntFill | 10 | 162437 | 273799 | 69%
ArraysFill.testIntFill | 20 | 156636 | 273483 | 75%
ArraysFill.testIntFill | 30 | 162440 | 243303 | 50%
ArraysFill.testIntFill | 40 | 156592 | 175162 | 12%
ArraysFill.testIntFill | 50 | 156585 | 168433 | 8%
ArraysFill.testIntFill | 60 | 151193 | 195235 | 29%
ArraysFill.testIntFill | 70 | 141406 | 167060 | 18%
ArraysFill.testIntFill | 80 | 141406 | 167119 | 18%
ArraysFill.testIntFill | 90 | 141437 | 166976 | 18%
ArraysFill.testIntFill | 100 | 168349 | 173943 | 3%
ArraysFill.testIntFill | 110 | 132864 | 173096 | 30%
ArraysFill.testIntFill | 120 | 128972 | 173722 | 35%
ArraysFill.testIntFill | 130 | 128958 | 149835 | 16%
ArraysFill.testIntFill | 140 | 167934 | 165903 | -1%
ArraysFill.testIntFill | 150 | 121799 | 133351 | 9%
ArraysFill.testIntFill | 160 | 121824 | 154654 | 27%
ArraysFill.testIntFill | 170 | 121800 | 163515 | 34%
ArraysFill.testIntFill | 180 | 121770 | 150235 | 23%
ArraysFill.testIntFill | 190 | 121808 | 145138 | 19%
ArraysFill.testIntFill | 200 | 112433 | 142084 | 26%
ArraysFill.testShortFill | 1 | 99696 | 309697 | 211%
ArraysFill.testShortFill | 10 | 175433 | 290773 | 66%
ArraysFill.testShortFill | 20 | 175417 | 270345 | 54%
ArraysFill.testShortFill | 30 | 162459 | 257180 | 58%
ArraysFill.testShortFill | 40 | 175438 | 273348 | 56%
ArraysFill.testShortFill | 50 | 162445 | 272307 | 68%
ArraysFill.testShortFill | 60 | 168669 | 241798 | 43%
ArraysFill.testShortFill | 70 | 156509 | 174347 | 11%
ArraysFill.testShortFill | 80 | 151207 | 168424 | 11%
ArraysFill.testShortFill | 90 | 162332 | 197780 | 22%
ArraysFill.testShortFill | 100 | 156583 | 174738 | 12%
ArraysFill.testShortFill | 110 | 151147 | 175170 | 16%
ArraysFill.testShortFill | 120 | 167078 | 191352 | 15%
ArraysFill.testShortFill | 130 | 146102 | 171682 | 18%
ArraysFill.testShortFill | 140 | 151206 | 203786 | 35%
ArraysFill.testShortFill | 150 | 146133 | 167027 | 14%
ArraysFill.testShortFill | 160 | 141426 | 167047 | 18%
ArraysFill.testShortFill | 170 | 136974 | 167049 | 22%
ArraysFill.testShortFill | 180 | 141420 | 173568 | 23%
ArraysFill.testShortFill | 190 | 136164 | 172806 | 27%
ArraysFill.testShortFill | 200 | 141464 | 167048 | 18%

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3761712841

Reply via email to