Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests [v4]

2021-01-19 Thread Valerie Peng
On Mon, 18 Jan 2021 13:39:04 GMT, Claes Redestad  wrote:

>> - The MD5 intrinsics added by 
>> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
>> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
>> from which the MD5 intrinsic takes inspiration
>> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
>> make it acceptable to use inline and replace the array in MD5 wholesale. 
>> This improves performance both in the presence and the absence of the 
>> intrinsic optimization.
>> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
>> arrays), but allocating the array lazily gets most of the speed-up in the 
>> presence of an intrinsic while being neutral in its absence.
>> 
>> Baseline:
>>   (digesterName)  (length)Cnt Score  
>> Error   Units
>> MessageDigests.digestMD516 15  
>> 2714.307 ±   21.133  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 318.087 ±0.637  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1387.266 ±   40.932  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 109.273 ±0.149  ops/ms
>> MessageDigests.digestSHA-25616 15   
>> 995.566 ±   21.186  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.104 ±0.079  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 803.030 ±   15.722  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 115.611 ±0.234  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2190.367 ±   97.037  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 302.903 ±1.809  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1262.656 ±   43.751  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 104.889 ±3.554  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 914.541 ±   55.621  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 85.708 ±1.394  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 737.719 ±   53.671  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.307 ±1.950  ops/ms
>> 
>> GC:
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
>> 312.011 ±0.005B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
>> 584.020 ±0.006B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
>> 544.019 ±0.016B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
>> 1056.037 ±0.003B/op
>> 
>> Target:
>> Benchmark (digesterName)  (length)Cnt
>>  Score  Error   Units
>> MessageDigests.digestMD516 15  
>> 3134.462 ±   43.685  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 323.667 ±0.633  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1418.742 ±   38.223  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 110.178 ±0.788  ops/ms
>> MessageDigests.digestSHA-25616 15  
>> 1037.949 ±   21.214  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.671 ±0.228  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 812.028 ±   39.489  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 116.738 ±0.249  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2314.379 ±  229.294  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 307.835 ±5.730  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1326.887 ±   63.263  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 106.611 ±2.292  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 961.589 ±   82.052  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 88.646 ±0.194  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 775.417 ±   56.775  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.904 ±2.014  ops/ms
>> 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests [v4]

2021-01-18 Thread Claes Redestad
> - The MD5 intrinsics added by 
> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
> from which the MD5 intrinsic takes inspiration
> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
> make it acceptable to use inline and replace the array in MD5 wholesale. This 
> improves performance both in the presence and the absence of the intrinsic 
> optimization.
> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
> arrays), but allocating the array lazily gets most of the speed-up in the 
> presence of an intrinsic while being neutral in its absence.
> 
> Baseline:
>   (digesterName)  (length)Cnt Score  
> Error   Units
> MessageDigests.digestMD516 15  
> 2714.307 ±   21.133  ops/ms
> MessageDigests.digestMD5  1024 15   
> 318.087 ±0.637  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1387.266 ±   40.932  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 109.273 ±0.149  ops/ms
> MessageDigests.digestSHA-25616 15   
> 995.566 ±   21.186  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.104 ±0.079  ops/ms
> MessageDigests.digestSHA-51216 15   
> 803.030 ±   15.722  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 115.611 ±0.234  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2190.367 ±   97.037  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 302.903 ±1.809  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1262.656 ±   43.751  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 104.889 ±3.554  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 914.541 ±   55.621  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 85.708 ±1.394  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 737.719 ±   53.671  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.307 ±1.950  ops/ms
> 
> GC:
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 312.011 ±0.005B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
> 584.020 ±0.006B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
> 544.019 ±0.016B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
> 1056.037 ±0.003B/op
> 
> Target:
> Benchmark (digesterName)  (length)Cnt 
> Score  Error   Units
> MessageDigests.digestMD516 15  
> 3134.462 ±   43.685  ops/ms
> MessageDigests.digestMD5  1024 15   
> 323.667 ±0.633  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1418.742 ±   38.223  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 110.178 ±0.788  ops/ms
> MessageDigests.digestSHA-25616 15  
> 1037.949 ±   21.214  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.671 ±0.228  ops/ms
> MessageDigests.digestSHA-51216 15   
> 812.028 ±   39.489  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 116.738 ±0.249  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2314.379 ±  229.294  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 307.835 ±5.730  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1326.887 ±   63.263  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 106.611 ±2.292  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 961.589 ±   82.052  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 88.646 ±0.194  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 775.417 ±   56.775  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.904 ±2.014  ops/ms
> 
> GC
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 232.009 ±0.006B/op
> 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests [v3]

2021-01-18 Thread Claes Redestad
> - The MD5 intrinsics added by 
> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
> from which the MD5 intrinsic takes inspiration
> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
> make it acceptable to use inline and replace the array in MD5 wholesale. This 
> improves performance both in the presence and the absence of the intrinsic 
> optimization.
> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
> arrays), but allocating the array lazily gets most of the speed-up in the 
> presence of an intrinsic while being neutral in its absence.
> 
> Baseline:
>   (digesterName)  (length)Cnt Score  
> Error   Units
> MessageDigests.digestMD516 15  
> 2714.307 ±   21.133  ops/ms
> MessageDigests.digestMD5  1024 15   
> 318.087 ±0.637  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1387.266 ±   40.932  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 109.273 ±0.149  ops/ms
> MessageDigests.digestSHA-25616 15   
> 995.566 ±   21.186  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.104 ±0.079  ops/ms
> MessageDigests.digestSHA-51216 15   
> 803.030 ±   15.722  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 115.611 ±0.234  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2190.367 ±   97.037  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 302.903 ±1.809  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1262.656 ±   43.751  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 104.889 ±3.554  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 914.541 ±   55.621  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 85.708 ±1.394  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 737.719 ±   53.671  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.307 ±1.950  ops/ms
> 
> GC:
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 312.011 ±0.005B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
> 584.020 ±0.006B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
> 544.019 ±0.016B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
> 1056.037 ±0.003B/op
> 
> Target:
> Benchmark (digesterName)  (length)Cnt 
> Score  Error   Units
> MessageDigests.digestMD516 15  
> 3134.462 ±   43.685  ops/ms
> MessageDigests.digestMD5  1024 15   
> 323.667 ±0.633  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1418.742 ±   38.223  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 110.178 ±0.788  ops/ms
> MessageDigests.digestSHA-25616 15  
> 1037.949 ±   21.214  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.671 ±0.228  ops/ms
> MessageDigests.digestSHA-51216 15   
> 812.028 ±   39.489  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 116.738 ±0.249  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2314.379 ±  229.294  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 307.835 ±5.730  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1326.887 ±   63.263  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 106.611 ±2.292  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 961.589 ±   82.052  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 88.646 ±0.194  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 775.417 ±   56.775  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.904 ±2.014  ops/ms
> 
> GC
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 232.009 ±0.006B/op
> 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests [v2]

2021-01-15 Thread Valerie Peng
On Fri, 15 Jan 2021 23:36:35 GMT, Claes Redestad  wrote:

>> - The MD5 intrinsics added by 
>> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
>> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
>> from which the MD5 intrinsic takes inspiration
>> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
>> make it acceptable to use inline and replace the array in MD5 wholesale. 
>> This improves performance both in the presence and the absence of the 
>> intrinsic optimization.
>> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
>> arrays), but allocating the array lazily gets most of the speed-up in the 
>> presence of an intrinsic while being neutral in its absence.
>> 
>> Baseline:
>>   (digesterName)  (length)Cnt Score  
>> Error   Units
>> MessageDigests.digestMD516 15  
>> 2714.307 ±   21.133  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 318.087 ±0.637  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1387.266 ±   40.932  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 109.273 ±0.149  ops/ms
>> MessageDigests.digestSHA-25616 15   
>> 995.566 ±   21.186  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.104 ±0.079  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 803.030 ±   15.722  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 115.611 ±0.234  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2190.367 ±   97.037  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 302.903 ±1.809  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1262.656 ±   43.751  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 104.889 ±3.554  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 914.541 ±   55.621  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 85.708 ±1.394  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 737.719 ±   53.671  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.307 ±1.950  ops/ms
>> 
>> GC:
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
>> 312.011 ±0.005B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
>> 584.020 ±0.006B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
>> 544.019 ±0.016B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
>> 1056.037 ±0.003B/op
>> 
>> Target:
>> Benchmark (digesterName)  (length)Cnt
>>  Score  Error   Units
>> MessageDigests.digestMD516 15  
>> 3134.462 ±   43.685  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 323.667 ±0.633  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1418.742 ±   38.223  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 110.178 ±0.788  ops/ms
>> MessageDigests.digestSHA-25616 15  
>> 1037.949 ±   21.214  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.671 ±0.228  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 812.028 ±   39.489  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 116.738 ±0.249  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2314.379 ±  229.294  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 307.835 ±5.730  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1326.887 ±   63.263  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 106.611 ±2.292  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 961.589 ±   82.052  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 88.646 ±0.194  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 775.417 ±   56.775  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.904 ±2.014  ops/ms
>> 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests [v2]

2021-01-15 Thread Claes Redestad
> - The MD5 intrinsics added by 
> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
> from which the MD5 intrinsic takes inspiration
> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
> make it acceptable to use inline and replace the array in MD5 wholesale. This 
> improves performance both in the presence and the absence of the intrinsic 
> optimization.
> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
> arrays), but allocating the array lazily gets most of the speed-up in the 
> presence of an intrinsic while being neutral in its absence.
> 
> Baseline:
>   (digesterName)  (length)Cnt Score  
> Error   Units
> MessageDigests.digestMD516 15  
> 2714.307 ±   21.133  ops/ms
> MessageDigests.digestMD5  1024 15   
> 318.087 ±0.637  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1387.266 ±   40.932  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 109.273 ±0.149  ops/ms
> MessageDigests.digestSHA-25616 15   
> 995.566 ±   21.186  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.104 ±0.079  ops/ms
> MessageDigests.digestSHA-51216 15   
> 803.030 ±   15.722  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 115.611 ±0.234  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2190.367 ±   97.037  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 302.903 ±1.809  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1262.656 ±   43.751  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 104.889 ±3.554  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 914.541 ±   55.621  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 85.708 ±1.394  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 737.719 ±   53.671  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.307 ±1.950  ops/ms
> 
> GC:
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 312.011 ±0.005B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
> 584.020 ±0.006B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
> 544.019 ±0.016B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
> 1056.037 ±0.003B/op
> 
> Target:
> Benchmark (digesterName)  (length)Cnt 
> Score  Error   Units
> MessageDigests.digestMD516 15  
> 3134.462 ±   43.685  ops/ms
> MessageDigests.digestMD5  1024 15   
> 323.667 ±0.633  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1418.742 ±   38.223  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 110.178 ±0.788  ops/ms
> MessageDigests.digestSHA-25616 15  
> 1037.949 ±   21.214  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.671 ±0.228  ops/ms
> MessageDigests.digestSHA-51216 15   
> 812.028 ±   39.489  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 116.738 ±0.249  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2314.379 ±  229.294  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 307.835 ±5.730  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1326.887 ±   63.263  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 106.611 ±2.292  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 961.589 ±   82.052  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 88.646 ±0.194  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 775.417 ±   56.775  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.904 ±2.014  ops/ms
> 
> GC
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 232.009 ±0.006B/op
> 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests [v2]

2021-01-15 Thread Claes Redestad
On Fri, 15 Jan 2021 23:21:00 GMT, Valerie Peng  wrote:

>> Claes Redestad has updated the pull request with a new target base due to a 
>> merge or a rebase. The incremental webrev excludes the unrelated changes 
>> brought in by the merge/rebase. The pull request contains 20 additional 
>> commits since the last revision:
>> 
>>  - Copyrights
>>  - Merge branch 'master' into improve_md5
>>  - Remove unused Unsafe import
>>  - Harmonize MD4 impl, remove now-redundant checks from ByteArrayAccess (VHs 
>> do bounds checks, most of which will be optimized away)
>>  - Merge branch 'master' into improve_md5
>>  - Apply allocation avoiding optimizations to all SHA versions sharing 
>> structural similarities with MD5
>>  - Remove unused reverseBytes imports
>>  - Copyrights
>>  - Fix copy-paste error
>>  - Various fixes (IDE stopped IDEing..)
>>  - ... and 10 more: 
>> https://git.openjdk.java.net/jdk/compare/6e03c8d3...cafa3e49
>
> test/micro/org/openjdk/bench/java/util/UUIDBench.java line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights 
>> reserved.
> 
> nit: other files should also have this 2021 update. It seems most of them are 
> not updated and still uses 2020.

fixed

-

PR: https://git.openjdk.java.net/jdk/pull/1855


Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-15 Thread Valerie Peng
On Sun, 20 Dec 2020 20:27:03 GMT, Claes Redestad  wrote:

> - The MD5 intrinsics added by 
> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
> from which the MD5 intrinsic takes inspiration
> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
> make it acceptable to use inline and replace the array in MD5 wholesale. This 
> improves performance both in the presence and the absence of the intrinsic 
> optimization.
> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
> arrays), but allocating the array lazily gets most of the speed-up in the 
> presence of an intrinsic while being neutral in its absence.
> 
> Baseline:
>   (digesterName)  (length)Cnt Score  
> Error   Units
> MessageDigests.digestMD516 15  
> 2714.307 ±   21.133  ops/ms
> MessageDigests.digestMD5  1024 15   
> 318.087 ±0.637  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1387.266 ±   40.932  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 109.273 ±0.149  ops/ms
> MessageDigests.digestSHA-25616 15   
> 995.566 ±   21.186  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.104 ±0.079  ops/ms
> MessageDigests.digestSHA-51216 15   
> 803.030 ±   15.722  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 115.611 ±0.234  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2190.367 ±   97.037  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 302.903 ±1.809  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1262.656 ±   43.751  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 104.889 ±3.554  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 914.541 ±   55.621  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 85.708 ±1.394  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 737.719 ±   53.671  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.307 ±1.950  ops/ms
> 
> GC:
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 312.011 ±0.005B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
> 584.020 ±0.006B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
> 544.019 ±0.016B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
> 1056.037 ±0.003B/op
> 
> Target:
> Benchmark (digesterName)  (length)Cnt 
> Score  Error   Units
> MessageDigests.digestMD516 15  
> 3134.462 ±   43.685  ops/ms
> MessageDigests.digestMD5  1024 15   
> 323.667 ±0.633  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1418.742 ±   38.223  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 110.178 ±0.788  ops/ms
> MessageDigests.digestSHA-25616 15  
> 1037.949 ±   21.214  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.671 ±0.228  ops/ms
> MessageDigests.digestSHA-51216 15   
> 812.028 ±   39.489  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 116.738 ±0.249  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2314.379 ±  229.294  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 307.835 ±5.730  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1326.887 ±   63.263  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 106.611 ±2.292  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 961.589 ±   82.052  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 88.646 ±0.194  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 775.417 ±   56.775  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.904 ±2.014  ops/ms
> 
> GC
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 232.009 ± 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-15 Thread Claes Redestad
On Fri, 15 Jan 2021 22:54:32 GMT, Valerie Peng  wrote:

>> - The MD5 intrinsics added by 
>> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
>> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
>> from which the MD5 intrinsic takes inspiration
>> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
>> make it acceptable to use inline and replace the array in MD5 wholesale. 
>> This improves performance both in the presence and the absence of the 
>> intrinsic optimization.
>> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
>> arrays), but allocating the array lazily gets most of the speed-up in the 
>> presence of an intrinsic while being neutral in its absence.
>> 
>> Baseline:
>>   (digesterName)  (length)Cnt Score  
>> Error   Units
>> MessageDigests.digestMD516 15  
>> 2714.307 ±   21.133  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 318.087 ±0.637  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1387.266 ±   40.932  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 109.273 ±0.149  ops/ms
>> MessageDigests.digestSHA-25616 15   
>> 995.566 ±   21.186  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.104 ±0.079  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 803.030 ±   15.722  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 115.611 ±0.234  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2190.367 ±   97.037  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 302.903 ±1.809  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1262.656 ±   43.751  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 104.889 ±3.554  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 914.541 ±   55.621  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 85.708 ±1.394  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 737.719 ±   53.671  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.307 ±1.950  ops/ms
>> 
>> GC:
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
>> 312.011 ±0.005B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
>> 584.020 ±0.006B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
>> 544.019 ±0.016B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
>> 1056.037 ±0.003B/op
>> 
>> Target:
>> Benchmark (digesterName)  (length)Cnt
>>  Score  Error   Units
>> MessageDigests.digestMD516 15  
>> 3134.462 ±   43.685  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 323.667 ±0.633  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1418.742 ±   38.223  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 110.178 ±0.788  ops/ms
>> MessageDigests.digestSHA-25616 15  
>> 1037.949 ±   21.214  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.671 ±0.228  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 812.028 ±   39.489  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 116.738 ±0.249  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2314.379 ±  229.294  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 307.835 ±5.730  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1326.887 ±   63.263  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 106.611 ±2.292  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 961.589 ±   82.052  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 88.646 ±0.194  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 775.417 ±   56.775  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.904 ±2.014  ops/ms
>> 
>> 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-15 Thread Valerie Peng
On Sun, 20 Dec 2020 20:27:03 GMT, Claes Redestad  wrote:

> - The MD5 intrinsics added by 
> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
> from which the MD5 intrinsic takes inspiration
> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
> make it acceptable to use inline and replace the array in MD5 wholesale. This 
> improves performance both in the presence and the absence of the intrinsic 
> optimization.
> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
> arrays), but allocating the array lazily gets most of the speed-up in the 
> presence of an intrinsic while being neutral in its absence.
> 
> Baseline:
>   (digesterName)  (length)Cnt Score  
> Error   Units
> MessageDigests.digestMD516 15  
> 2714.307 ±   21.133  ops/ms
> MessageDigests.digestMD5  1024 15   
> 318.087 ±0.637  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1387.266 ±   40.932  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 109.273 ±0.149  ops/ms
> MessageDigests.digestSHA-25616 15   
> 995.566 ±   21.186  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.104 ±0.079  ops/ms
> MessageDigests.digestSHA-51216 15   
> 803.030 ±   15.722  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 115.611 ±0.234  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2190.367 ±   97.037  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 302.903 ±1.809  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1262.656 ±   43.751  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 104.889 ±3.554  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 914.541 ±   55.621  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 85.708 ±1.394  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 737.719 ±   53.671  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.307 ±1.950  ops/ms
> 
> GC:
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 312.011 ±0.005B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
> 584.020 ±0.006B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
> 544.019 ±0.016B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
> 1056.037 ±0.003B/op
> 
> Target:
> Benchmark (digesterName)  (length)Cnt 
> Score  Error   Units
> MessageDigests.digestMD516 15  
> 3134.462 ±   43.685  ops/ms
> MessageDigests.digestMD5  1024 15   
> 323.667 ±0.633  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1418.742 ±   38.223  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 110.178 ±0.788  ops/ms
> MessageDigests.digestSHA-25616 15  
> 1037.949 ±   21.214  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.671 ±0.228  ops/ms
> MessageDigests.digestSHA-51216 15   
> 812.028 ±   39.489  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 116.738 ±0.249  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2314.379 ±  229.294  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 307.835 ±5.730  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1326.887 ±   63.263  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 106.611 ±2.292  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 961.589 ±   82.052  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 88.646 ±0.194  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 775.417 ±   56.775  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.904 ±2.014  ops/ms
> 
> GC
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 232.009 ± 

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-12 Thread Valerie Peng
On Thu, 7 Jan 2021 18:50:05 GMT, Claes Redestad  wrote:

>> Removing the UUID clone cache and running the microbenchmark along with the 
>> changes in #1933:
>> 
>> Benchmark  (size)   Mode  
>> CntScoreError   Units
>> UUIDBench.fromType3Bytes2  thrpt   
>> 122.182 ±  0.090  ops/us
>> UUIDBench.fromType3Bytes:·gc.alloc.rate 2  thrpt   
>> 12  439.020 ± 18.241  MB/sec
>> UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   
>> 12  264.022 ±  0.003B/op
>> 
>> The goal now is if to simplify the digest code and compare alternatives.
>
> I've run various tests and concluded that the `VarHandle`ized code is 
> matching or improving upon the `Unsafe`-riddled code in `ByteArrayAccess`. I 
> then went ahead and consolidated to use similar code pattern in 
> `ByteArrayAccess` for consistency, which amounts to a good cleanup.
> 
> With MD5 intrinsics disabled, I get this baseline:
> 
> Benchmark  (size)   Mode  Cnt 
>ScoreError   Units
> UUIDBench.fromType3Bytes2  thrpt   12 
>1.245 ±  0.077  ops/us
> UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12 
>  488.042 ±  0.004B/op
> 
> With the current patch here (not including #1933): 
> Benchmark  (size)   Mode  Cnt 
>ScoreError   Units
> UUIDBench.fromType3Bytes2  thrpt   12 
>1.431 ±  0.106  ops/us
> UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12 
>  408.035 ±  0.006B/op
> 
> If I isolate the `ByteArrayAccess` changes I'm getting performance neutral or 
> slightly better numbers compared to baseline for these tests:
> 
> Benchmark  (size)   Mode  Cnt 
>ScoreError   Units
> UUIDBench.fromType3Bytes2  thrpt   12 
>1.317 ±  0.092  ops/us
> UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12 
>  488.042 ±  0.004B/op

Thanks for the performance enhancement, I will take a look.

-

PR: https://git.openjdk.java.net/jdk/pull/1855


Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-08 Thread Claes Redestad
On Wed, 6 Jan 2021 01:27:52 GMT, Claes Redestad  wrote:

>> Hitting up `new MD5()` directly could be a great idea. I expect this would 
>> be just as fast as the cache+clone (if not faster), but I'm a bit worried 
>> we'd be short-circuiting the ability to install an alternative MD5 provider 
>> (which may or may not be a thing we must support..), but it's worth 
>> exploring.
>> 
>> Comparing performance of this against a `ByteBuffer` impl is on my TODO. The 
>> `VarHandle` gets heavily inlined and optimized here, though, with 
>> performance in my tests similar to the `Unsafe` use in `ByteArrayAccess`.
>
> I've identified a number of optimizations to the plumbing behind 
> `MessageDigest.getDigest(..)` over in #1933 that removes 80-90% of the 
> throughput overhead and all the allocation overhead compared to the `clone()` 
> approach prototyped here. The remaining 20ns/op overhead might not be enough 
> of a concern to do a point fix in `UUID::nameUUIDFromBytes`.

Removing the UUID clone cache and running the microbenchmark along with the 
changes in #1933:

Benchmark  (size)   Mode  Cnt   
 ScoreError   Units
UUIDBench.fromType3Bytes2  thrpt   12   
 2.182 ±  0.090  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate 2  thrpt   12  
439.020 ± 18.241  MB/sec
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12  
264.022 ±  0.003B/op

The goal now is if to simplify the digest code and compare alternatives.

-

PR: https://git.openjdk.java.net/jdk/pull/1855


Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-08 Thread Claes Redestad
On Thu, 7 Jan 2021 14:45:03 GMT, Claes Redestad  wrote:

>> I've identified a number of optimizations to the plumbing behind 
>> `MessageDigest.getDigest(..)` over in #1933 that removes 80-90% of the 
>> throughput overhead and all the allocation overhead compared to the 
>> `clone()` approach prototyped here. The remaining 20ns/op overhead might not 
>> be enough of a concern to do a point fix in `UUID::nameUUIDFromBytes`.
>
> Removing the UUID clone cache and running the microbenchmark along with the 
> changes in #1933:
> 
> Benchmark  (size)   Mode  Cnt 
>ScoreError   Units
> UUIDBench.fromType3Bytes2  thrpt   12 
>2.182 ±  0.090  ops/us
> UUIDBench.fromType3Bytes:·gc.alloc.rate 2  thrpt   12 
>  439.020 ± 18.241  MB/sec
> UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12 
>  264.022 ±  0.003B/op
> 
> The goal now is if to simplify the digest code and compare alternatives.

I've run various tests and concluded that the `VarHandle`ized code is matching 
or improving upon the `Unsafe`-riddled code in `ByteArrayAccess`. I then went 
ahead and consolidated to use similar code pattern in `ByteArrayAccess` for 
consistency, which amounts to a good cleanup.

With MD5 intrinsics disabled, I get this baseline:

Benchmark  (size)   Mode  Cnt   
 ScoreError   Units
UUIDBench.fromType3Bytes2  thrpt   12   
 1.245 ±  0.077  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12  
488.042 ±  0.004B/op

With the current patch here (not including #1933): 
Benchmark  (size)   Mode  Cnt   
 ScoreError   Units
UUIDBench.fromType3Bytes2  thrpt   12   
 1.431 ±  0.106  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12  
408.035 ±  0.006B/op

If I isolate the `ByteArrayAccess` changes I'm getting performance neutral or 
slightly better numbers compared to baseline for these tests:

Benchmark  (size)   Mode  Cnt   
 ScoreError   Units
UUIDBench.fromType3Bytes2  thrpt   12   
 1.317 ±  0.092  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm2  thrpt   12  
488.042 ±  0.004B/op

-

PR: https://git.openjdk.java.net/jdk/pull/1855


Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-08 Thread Claes Redestad
On Tue, 5 Jan 2021 23:08:43 GMT, DellCliff 
 wrote:

>> Since `java.util.UUID` and `sun.security.provider.MD5` are both in 
>> `java.base`, would it make sense to create new instances by calling `new 
>> MD5()` instead of `java.security.MessageDigest.getInstance("MD5")` and 
>> bypassing the whole MessageDigest logic?
>
> Are you sure you're not ending up paying more using a VarHandle and having to 
> cast and using a var args call `(long) LONG_ARRAY_HANDLE.get(buf, ofs);` 
> instead of creating a ByteBuffer once via 
> `ByteBuffer.wrap(buffer).order(ByteOrder.nativeOrder()).asLongBuffer()`?

Hitting up `new MD5()` directly could be a great idea. I expect this would be 
just as fast as the cache+clone (if not faster), but I'm a bit worried we'd be 
short-circuiting the ability to install an alternative MD5 provider (which may 
or may not be a thing we must support..), but it's worth exploring.

Comparing performance of this against a `ByteBuffer` impl is on my TODO. The 
`VarHandle` gets heavily inlined and optimized here, though, with performance 
in my tests similar to the `Unsafe` use in `ByteArrayAccess`.

-

PR: https://git.openjdk.java.net/jdk/pull/1855


Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-08 Thread Claes Redestad
On Wed, 6 Jan 2021 00:41:29 GMT, Claes Redestad  wrote:

>> Are you sure you're not ending up paying more using a VarHandle and having 
>> to cast and using a var args call `(long) LONG_ARRAY_HANDLE.get(buf, ofs);` 
>> instead of creating a ByteBuffer once via 
>> `ByteBuffer.wrap(buffer).order(ByteOrder.nativeOrder()).asLongBuffer()`?
>
> Hitting up `new MD5()` directly could be a great idea. I expect this would be 
> just as fast as the cache+clone (if not faster), but I'm a bit worried we'd 
> be short-circuiting the ability to install an alternative MD5 provider (which 
> may or may not be a thing we must support..), but it's worth exploring.
> 
> Comparing performance of this against a `ByteBuffer` impl is on my TODO. The 
> `VarHandle` gets heavily inlined and optimized here, though, with performance 
> in my tests similar to the `Unsafe` use in `ByteArrayAccess`.

I've identified a number of optimizations to the plumbing behind 
`MessageDigest.getDigest(..)` over in #1933 that removes 80-90% of the 
throughput overhead and all the allocation overhead compared to the `clone()` 
approach prototyped here. The remaining 20ns/op overhead might not be enough of 
a concern to do a point fix in `UUID::nameUUIDFromBytes`.

-

PR: https://git.openjdk.java.net/jdk/pull/1855


RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-08 Thread Claes Redestad
- The MD5 intrinsics added by 
[JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the 
`int[] x` isn't actually needed. This also applies to the SHA intrinsics from 
which the MD5 intrinsic takes inspiration
- Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make 
it acceptable to use inline and replace the array in MD5 wholesale. This 
improves performance both in the presence and the absence of the intrinsic 
optimization.
- Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
arrays), but allocating the array lazily gets most of the speed-up in the 
presence of an intrinsic while being neutral in its absence.

Baseline:
  (digesterName)  (length)Cnt Score  
Error   Units
MessageDigests.digestMD516 15  
2714.307 ±   21.133  ops/ms
MessageDigests.digestMD5  1024 15   
318.087 ±0.637  ops/ms
MessageDigests.digest  SHA-116 15  
1387.266 ±   40.932  ops/ms
MessageDigests.digest  SHA-1  1024 15   
109.273 ±0.149  ops/ms
MessageDigests.digestSHA-25616 15   
995.566 ±   21.186  ops/ms
MessageDigests.digestSHA-256  1024 15
89.104 ±0.079  ops/ms
MessageDigests.digestSHA-51216 15   
803.030 ±   15.722  ops/ms
MessageDigests.digestSHA-512  1024 15   
115.611 ±0.234  ops/ms
MessageDigests.getAndDigest  MD516 15  
2190.367 ±   97.037  ops/ms
MessageDigests.getAndDigest  MD5  1024 15   
302.903 ±1.809  ops/ms
MessageDigests.getAndDigestSHA-116 15  
1262.656 ±   43.751  ops/ms
MessageDigests.getAndDigestSHA-1  1024 15   
104.889 ±3.554  ops/ms
MessageDigests.getAndDigest  SHA-25616 15   
914.541 ±   55.621  ops/ms
MessageDigests.getAndDigest  SHA-256  1024 15
85.708 ±1.394  ops/ms
MessageDigests.getAndDigest  SHA-51216 15   
737.719 ±   53.671  ops/ms
MessageDigests.getAndDigest  SHA-512  1024 15   
112.307 ±1.950  ops/ms

GC:
MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
312.011 ±0.005B/op
MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
584.020 ±0.006B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
544.019 ±0.016B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
1056.037 ±0.003B/op

Target:
Benchmark (digesterName)  (length)Cnt 
Score  Error   Units
MessageDigests.digestMD516 15  
3134.462 ±   43.685  ops/ms
MessageDigests.digestMD5  1024 15   
323.667 ±0.633  ops/ms
MessageDigests.digest  SHA-116 15  
1418.742 ±   38.223  ops/ms
MessageDigests.digest  SHA-1  1024 15   
110.178 ±0.788  ops/ms
MessageDigests.digestSHA-25616 15  
1037.949 ±   21.214  ops/ms
MessageDigests.digestSHA-256  1024 15
89.671 ±0.228  ops/ms
MessageDigests.digestSHA-51216 15   
812.028 ±   39.489  ops/ms
MessageDigests.digestSHA-512  1024 15   
116.738 ±0.249  ops/ms
MessageDigests.getAndDigest  MD516 15  
2314.379 ±  229.294  ops/ms
MessageDigests.getAndDigest  MD5  1024 15   
307.835 ±5.730  ops/ms
MessageDigests.getAndDigestSHA-116 15  
1326.887 ±   63.263  ops/ms
MessageDigests.getAndDigestSHA-1  1024 15   
106.611 ±2.292  ops/ms
MessageDigests.getAndDigest  SHA-25616 15   
961.589 ±   82.052  ops/ms
MessageDigests.getAndDigest  SHA-256  1024 15
88.646 ±0.194  ops/ms
MessageDigests.getAndDigest  SHA-51216 15   
775.417 ±   56.775  ops/ms
MessageDigests.getAndDigest  SHA-512  1024 15   
112.904 ±2.014  ops/ms

GC
MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
232.009 ±0.006B/op
MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
584.021 ±0.001B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
272.012 ±0.015B/op

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-08 Thread DellCliff
On Tue, 5 Jan 2021 21:51:51 GMT, DellCliff 
 wrote:

>> - The MD5 intrinsics added by 
>> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
>> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
>> from which the MD5 intrinsic takes inspiration
>> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
>> make it acceptable to use inline and replace the array in MD5 wholesale. 
>> This improves performance both in the presence and the absence of the 
>> intrinsic optimization.
>> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
>> arrays), but allocating the array lazily gets most of the speed-up in the 
>> presence of an intrinsic while being neutral in its absence.
>> 
>> Baseline:
>>   (digesterName)  (length)Cnt Score  
>> Error   Units
>> MessageDigests.digestMD516 15  
>> 2714.307 ±   21.133  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 318.087 ±0.637  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1387.266 ±   40.932  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 109.273 ±0.149  ops/ms
>> MessageDigests.digestSHA-25616 15   
>> 995.566 ±   21.186  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.104 ±0.079  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 803.030 ±   15.722  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 115.611 ±0.234  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2190.367 ±   97.037  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 302.903 ±1.809  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1262.656 ±   43.751  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 104.889 ±3.554  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 914.541 ±   55.621  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 85.708 ±1.394  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 737.719 ±   53.671  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.307 ±1.950  ops/ms
>> 
>> GC:
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
>> 312.011 ±0.005B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
>> 584.020 ±0.006B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
>> 544.019 ±0.016B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
>> 1056.037 ±0.003B/op
>> 
>> Target:
>> Benchmark (digesterName)  (length)Cnt
>>  Score  Error   Units
>> MessageDigests.digestMD516 15  
>> 3134.462 ±   43.685  ops/ms
>> MessageDigests.digestMD5  1024 15   
>> 323.667 ±0.633  ops/ms
>> MessageDigests.digest  SHA-116 15  
>> 1418.742 ±   38.223  ops/ms
>> MessageDigests.digest  SHA-1  1024 15   
>> 110.178 ±0.788  ops/ms
>> MessageDigests.digestSHA-25616 15  
>> 1037.949 ±   21.214  ops/ms
>> MessageDigests.digestSHA-256  1024 15
>> 89.671 ±0.228  ops/ms
>> MessageDigests.digestSHA-51216 15   
>> 812.028 ±   39.489  ops/ms
>> MessageDigests.digestSHA-512  1024 15   
>> 116.738 ±0.249  ops/ms
>> MessageDigests.getAndDigest  MD516 15  
>> 2314.379 ±  229.294  ops/ms
>> MessageDigests.getAndDigest  MD5  1024 15   
>> 307.835 ±5.730  ops/ms
>> MessageDigests.getAndDigestSHA-116 15  
>> 1326.887 ±   63.263  ops/ms
>> MessageDigests.getAndDigestSHA-1  1024 15   
>> 106.611 ±2.292  ops/ms
>> MessageDigests.getAndDigest  SHA-25616 15   
>> 961.589 ±   82.052  ops/ms
>> MessageDigests.getAndDigest  SHA-256  1024 15
>> 88.646 ±0.194  ops/ms
>> MessageDigests.getAndDigest  SHA-51216 15   
>> 775.417 ±   56.775  ops/ms
>> MessageDigests.getAndDigest  SHA-512  1024 15   
>> 112.904 ±2.014  ops/ms
>> 
>> GC

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

2021-01-08 Thread DellCliff
On Sun, 20 Dec 2020 20:27:03 GMT, Claes Redestad  wrote:

> - The MD5 intrinsics added by 
> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
> from which the MD5 intrinsic takes inspiration
> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
> make it acceptable to use inline and replace the array in MD5 wholesale. This 
> improves performance both in the presence and the absence of the intrinsic 
> optimization.
> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
> arrays), but allocating the array lazily gets most of the speed-up in the 
> presence of an intrinsic while being neutral in its absence.
> 
> Baseline:
>   (digesterName)  (length)Cnt Score  
> Error   Units
> MessageDigests.digestMD516 15  
> 2714.307 ±   21.133  ops/ms
> MessageDigests.digestMD5  1024 15   
> 318.087 ±0.637  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1387.266 ±   40.932  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 109.273 ±0.149  ops/ms
> MessageDigests.digestSHA-25616 15   
> 995.566 ±   21.186  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.104 ±0.079  ops/ms
> MessageDigests.digestSHA-51216 15   
> 803.030 ±   15.722  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 115.611 ±0.234  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2190.367 ±   97.037  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 302.903 ±1.809  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1262.656 ±   43.751  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 104.889 ±3.554  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 914.541 ±   55.621  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 85.708 ±1.394  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 737.719 ±   53.671  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.307 ±1.950  ops/ms
> 
> GC:
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 312.011 ±0.005B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.normSHA-116 15   
> 584.020 ±0.006B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-25616 15   
> 544.019 ±0.016B/op
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-51216 15  
> 1056.037 ±0.003B/op
> 
> Target:
> Benchmark (digesterName)  (length)Cnt 
> Score  Error   Units
> MessageDigests.digestMD516 15  
> 3134.462 ±   43.685  ops/ms
> MessageDigests.digestMD5  1024 15   
> 323.667 ±0.633  ops/ms
> MessageDigests.digest  SHA-116 15  
> 1418.742 ±   38.223  ops/ms
> MessageDigests.digest  SHA-1  1024 15   
> 110.178 ±0.788  ops/ms
> MessageDigests.digestSHA-25616 15  
> 1037.949 ±   21.214  ops/ms
> MessageDigests.digestSHA-256  1024 15
> 89.671 ±0.228  ops/ms
> MessageDigests.digestSHA-51216 15   
> 812.028 ±   39.489  ops/ms
> MessageDigests.digestSHA-512  1024 15   
> 116.738 ±0.249  ops/ms
> MessageDigests.getAndDigest  MD516 15  
> 2314.379 ±  229.294  ops/ms
> MessageDigests.getAndDigest  MD5  1024 15   
> 307.835 ±5.730  ops/ms
> MessageDigests.getAndDigestSHA-116 15  
> 1326.887 ±   63.263  ops/ms
> MessageDigests.getAndDigestSHA-1  1024 15   
> 106.611 ±2.292  ops/ms
> MessageDigests.getAndDigest  SHA-25616 15   
> 961.589 ±   82.052  ops/ms
> MessageDigests.getAndDigest  SHA-256  1024 15
> 88.646 ±0.194  ops/ms
> MessageDigests.getAndDigest  SHA-51216 15   
> 775.417 ±   56.775  ops/ms
> MessageDigests.getAndDigest  SHA-512  1024 15   
> 112.904 ±2.014  ops/ms
> 
> GC
> MessageDigests.getAndDigest:·gc.alloc.rate.norm  MD516 15   
> 232.009 ±