Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v8]

Andrew Dinn Mon, 22 Jun 2026 02:40:25 -0700

On Fri, 19 Jun 2026 17:48:50 GMT, Sergey Bylokhov <[email protected]> wrote:


>> Thanks @adinn and @theRealAph! Could one of you also sponsor it?
>
> Hi @ferakocz, did you have a chance to run 
> test/micro/org/openjdk/bench/javax/crypto/full/PolynomialP256Bench.java on 
> this patch?
>>.../build/patched/images/jdk/bin/java -jar 
>>.../build/patched/images/test/micro/benchmarks.jar     
>>org.openjdk.bench.javax.crypto.full.PolynomialP256Bench.benchAssign -p 
>>isMontBench=true 
> 
> I got these numbers on my local laptop macOS on m4:
> Patched:
>>PolynomialP256Bench.benchAssign           true  thrpt    8  10230.113 ± 
>>146.263  ops/s
> 
> Baseline:
>>PolynomialP256Bench.benchAssign           true  thrpt    8  23548.039 ± 
>>1596.303  ops/s

@mrserb 

I am also seeing a slowdown for this specific micro-benchhmark on a fedora M2 
Mac:

Baseline:

>    PolynomialP256Bench.benchAssign true  thrpt    8  14774.689 ± 1764.136  
> ops/s

Patched:

>    PolynomialP256Bench.benchAssign true  thrpt    8  8171.365 ± 135.887  ops/s

The benchMultiply and benchSquare micro-benchmarks both show an improvement

Baseline:

>    PolynomialP256Bench.benchMultiply           true  thrpt    8  2624.022 ± 
> 1.985  ops/s
>    PolynomialP256Bench.benchSquare             true  thrpt    8  2629.698 ± 
> 3.645  ops/s


Patched:

>    PolynomialP256Bench.benchMultiply           true  thrpt    8  3200.923 ± 
> 3.748  ops/s
>    PolynomialP256Bench.benchSquare             true  thrpt    8  3203.488 ± 
> 3.074  ops/s

@ferakocz

I'm not sure we should automatically trust this benchmark run in isolation -- 
it is most important to gauge what effect the use of the multiply and assign 
intrinsics has when exercising the P256 API. The micro-benchmark result does 
suggest that the intrinsification of conditionalAssign may not always help on 
AArch64. However, it might still be the case that when employed in combination 
with the multiply intrinsic it is of benefit - possibly also depending on what 
hardware we are running on.

Your API/method level testing showed an improvement of 9% at the method level 
and 5% at the API level. Have you also run these tests on your M1 machine with 
the intrinsic for conditionalAssign omitted? If so what was the effect? If not 
then could you do so and let us know what difference it makes.

If you provide details of the tests run and how to exercise them I will happily 
check what the effect is on my M2 box if I disable generation of the 
conditionalAssign intrinsic. Perhaps @mrserb can do the same on his M4 Mac. 
Depending on the outcome might also want to check this on other AArch64 CPUs.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30941#issuecomment-4766910504

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v8]

Reply via email to