Integrated: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic
On Tue, 2 Apr 2024 15:42:05 GMT, Volodymyr Paprotski wrote: > Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... This pull request has now been integrated. Changeset: afed7d0b Author:Volodymyr Paprotski Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/afed7d0b0593864e5595840a6b645c210ff28c7c Stats: 2409 lines in 36 files changed: 2093 ins; 156 del; 160 mod 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic Reviewed-by: ihse, ascarpino, sviswanathan - PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]
On Tue, 21 May 2024 17:41:46 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request with a new target base due > to a merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 17 additional > commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into ecc-montgomery > - shenandoah verifier > - comments from Sandhya > - whitespace > - add message back > - whitespace > - Use AffinePoint to exit Montgomery domain > >Style notes: >Affine.equals() >- Mismatched fields only appear to be used from testing, perhaps > should be moved there instead >Affine.getX(boolean)|getY(boolean) >- "Passing flag is bad design" - cleanest/performant alternative to > several instanceof checks >- needed to convert Affine to Projective (need to stay in montgomery > domain) >ECOperations.PointMultiplier > - changes could probably be restored to original (since ProjectivePoint > handling no longer required) > - consider these changes an improvement? (fewer nested classes) > - was an inner-class but not using inner-class features (i.e. ecOps > variable should be converted) > - whitespace > - Comments from Tony and Jatin > - Comments from Jatin and Tony > - ... and 7 more: https://git.openjdk.org/jdk/compare/c0032e2c...b1a33004 Thanks Tobi! - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2124924526
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - Merge remote-tracking branch 'origin/master' into ecc-montgomery - shenandoah verifier - comments from Sandhya - whitespace - add message back - whitespace - Use AffinePoint to exit Montgomery domain Style notes: Affine.equals() - Mismatched fields only appear to be used from testing, perhaps should be moved there instead Affine.getX(boolean)|getY(boolean) - "Passing flag is bad design" - cleanest/performant alternative to several instanceof checks - needed to convert Affine to Projective (need to stay in montgomery domain) ECOperations.PointMultiplier - changes could probably be restored to original (since ProjectivePoint handling no longer required) - consider these changes an improvement? (fewer nested classes) - was an inner-class but not using inner-class features (i.e. ecOps variable should be converted) - whitespace - Comments from Tony and Jatin - Comments from Jatin and Tony - ... and 7 more: https://git.openjdk.org/jdk/compare/12e8009b...b1a33004 - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/df4fe6fa..b1a33004 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=11 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=10-11 Stats: 190975 lines in 3949 files changed: 105304 ins; 64688 del; 20983 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]
On Tue, 21 May 2024 07:21:14 GMT, Tobias Hartmann wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> shenandoah verifier > > I'm getting some conflicts when trying to apply this to master. Could you > please merge the PR? Hi @TobiHartmann , merged with no issues for me. Could you please run the tests again? (I think Tony did run them, but can't hurt verifying again). Thanks! - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2123122468
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]
On Fri, 17 May 2024 21:16:47 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > shenandoah verifier Thanks Sandhya! Now that I have @ascarpino approval as well, I plan to integrate next Tuesday. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2118443577
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: shenandoah verifier - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/5c360e35..df4fe6fa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=10 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=09-10 Stats: 7 lines in 2 files changed: 6 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
On Thu, 16 May 2024 23:21:36 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> whitespace > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168: > >> 166: XMMRegister broadcast5 = xmm24; >> 167: KRegister limb0 = k1; >> 168: KRegister limb5 = k2; > > limb5 and select are not being used anymore. Thanks, fixed (and also broadcast5) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185: > >> 183: __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), >> false, Assembler::AVX_512bit, rscratch); >> 184: >> 185: // A = load(*aLimbs) > > A little bit more description in comments on what the load step involves > would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, > or in the lowest limb. Done > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270: > >> 268: __ push(r14); >> 269: __ push(r15); >> 270: > > No need to save/restore rbx, r12, r14, r15. Only r13 is used as temp in > montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed > to r8. Seems I forgot to completely cleanup, thanks! (Originally copied from poly1305 stub) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286: > >> 284: __ mov(aLimbs, c_rarg0); >> 285: __ mov(bLimbs, c_rarg1); >> 286: __ mov(rLimbs, c_rarg2); > > We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then > these moves are not necessary. Gave them symbolic names and passed the gpr temp and parameter. vector register map still in the montgomeryMultiply function, but gprs explicitly passed in. 'close enough'? > src/hotspot/cpu/x86/vm_version_x86.cpp line 1370: > >> 1368: >> 1369: #ifdef _LP64 >> 1370: if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize >> >= 64) { > > No need to tie the intrinsic to MaxVectorSize setting. Done > src/hotspot/share/opto/library_call.cpp line 7564: > >> 7562: >> 7563: if (!stubAddr) return false; >> 7564: if (stopped()) return true; > > Line 7564 seems redundant here as there is no range check or anything like > that before this. Oh. That is what that is for... I thought it was some soft of 'VM quitting' short-circuit. Removed. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328906 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328960 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328859 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328829 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605329040 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328995
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v10]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: comments from Sandhya - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/8cd095dd..5c360e35 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=09 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=08-09 Stats: 82 lines in 4 files changed: 1 ins; 59 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: whitespace - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/83b21310..8cd095dd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=08 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]
On Thu, 9 May 2024 23:36:03 GMT, Anthony Scarpino wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> whitespace > > src/java.base/share/classes/sun/security/ec/ECOperations.java line 701: > >> 699: if (!m.equals(v)) { >> 700: java.util.HexFormat hex = >> java.util.HexFormat.of(); >> 701: throw new RuntimeException(); > > I think your cleanup went to far. You should have some message saying they > are not equal and if you don't want to print hex, remove getting an instance. I put the message back.. I removed it 'half'-intentionally; Was comparing against the original version and it didn't have any details, thought maybe should follow suit. But I did find this message helpful, so its back. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1596116606
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v8]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: add message back - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/1ecfdc44..83b21310 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=07 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=06-07 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: whitespace - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/8ff243a2..1ecfdc44 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=06 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=05-06 Stats: 753 lines in 9 files changed: 303 ins; 101 del; 349 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v6]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Use AffinePoint to exit Montgomery domain Style notes: Affine.equals() - Mismatched fields only appear to be used from testing, perhaps should be moved there instead Affine.getX(boolean)|getY(boolean) - "Passing flag is bad design" - cleanest/performant alternative to several instanceof checks - needed to convert Affine to Projective (need to stay in montgomery domain) ECOperations.PointMultiplier - changes could probably be restored to original (since ProjectivePoint handling no longer required) - consider these changes an improvement? (fewer nested classes) - was an inner-class but not using inner-class features (i.e. ecOps variable should be converted) - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/a1984501..8ff243a2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=05 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=04-05 Stats: 268 lines in 7 files changed: 89 ins; 147 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v5]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: whitespace - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/c93a71f0..a1984501 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=04 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=03-04 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]
On Tue, 23 Apr 2024 19:55:57 GMT, Anthony Scarpino wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Comments from Jatin and Tony > > src/java.base/share/classes/sun/security/ec/ECOperations.java line 204: > >> 202: * @return the product >> 203: */ >> 204: public MutablePoint multiply(AffinePoint affineP, byte[] s) { > > It seems like there could be some combining of both `multiply()`. If > `multiply(AffinePoint, ...)` is called, it can call `DefaultMultiplier` with > the `affineP`, but internally call the other `multiply(ECPoint, ...)` for the > other situations. I'd rather not have two methods doing most of the same > code, but different methods. Thanks, they indeed look identical, didnt notice. Fixed. (repeated the same hashmap refactoring and didnt notice I produced identical code twice) > src/java.base/share/classes/sun/security/ec/ECOperations.java line 467: > >> 465: sealed static abstract class SmallWindowMultiplier implements >> PointMultiplier >> 466: permits DefaultMultiplier, DefaultMontgomeryMultiplier { >> 467: private final AffinePoint affineP; > > I don't think `affineP` needs to be a class variable anymore. It's only used > in the constructor Didn't notice, thanks, fixed. > src/java.base/share/classes/sun/security/ec/ECOperations.java line 592: > >> 590: } >> 591: >> 592: private final ProjectivePoint.Immutable[][] points; > > Can you define this at the top please. Done > src/java.base/share/classes/sun/security/ec/ECOperations.java line 668: > >> 666: } >> 667: >> 668: private final BigInteger[] base; > > Can you define this at the top. You use it in the constructor but it's > defined later on. Done - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578117929 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578147190 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578148562 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578150303
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 9 Apr 2024 02:01:36 GMT, Anthony Scarpino wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/java.base/share/classes/sun/security/ec/ECOperations.java line 308: > >> 306: >> 307: /* >> 308: * public Point addition. Used by ECDSAOperations > > Was the old description not applicable anymore? It would be nice to improve > on the existing description that shortening it. Forgot to go back and fix the comment. Fixed.. As for the 'meaning'. Notice the signature of the function changed (i.e. no longer a 'mixed point', but two ProjectivePoints. This is a good idea regardless of Montgomery, but it affects montgomery particularly badly (need to compute zInv for 'no reason'. ) For sake of completeness. Apart from constructor, the 'API' for ECOperations (i.e. as used by ECDHE, ECDSAOperations and KeyGeneration) are these three functions (everything else is used internally by this class) public void setSum(MutablePoint p, MutablePoint p2) public MutablePoint multiply(AffinePoint affineP, byte[] s) public MutablePoint multiply(ECPoint ecPoint, byte[] s) > src/java.base/share/classes/sun/security/ec/ECOperations.java line 321: > >> 319: ECOperations ops = this; >> 320: if (this.montgomeryOps != null) { >> 321: assert p.getField() instanceof >> IntegerMontgomeryFieldModuloP; > > This should throw a ProviderException, I believe this would throw an > AssertionException Missed this comment. No longer applicable (this.montgomeryOps got refactored away) - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578144125 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578161140
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v4]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Comments from Tony and Jatin - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/6f9ac046..c93a71f0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=03 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=02-03 Stats: 48 lines in 2 files changed: 20 ins; 20 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 16 Apr 2024 02:26:57 GMT, Jatin Bhateja wrote: >> Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add >> alignment and loop rotation, but being a fallback figured its more important >> to keep it small > > It's all part of intrinsic, no harm in polishing it. Done (normalized loop/backedge). There was actually a problem in the loop counter.. (`i-=1` instead of `i-=16`). Can't include a test since classes are sealed, but verified manually. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578172873
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Fri, 5 Apr 2024 07:19:28 GMT, Jatin Bhateja wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39: > >> 37: }; >> 38: static address modulus_p256() { >> 39: return (address)MODULUS_P256; > > Long constants should have UL suffix. Properly ULL, but good point, fixed > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386: > >> 384: __ jcc(Assembler::equal, L_Length19); >> 385: >> 386: // Default copy loop > > Please add appropriate loop entry alignment. This is actually a 'switch statement default'. The default should never happen (See "Known Length comment on line 335"), but added because java code has that behavior. (i.e. in the unlikely case NIST adds a new elliptic curve to the existing standard?) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394: > >> 392: __ lea(aLimbs, Address(aLimbs,8)); >> 393: __ lea(bLimbs, Address(bLimbs,8)); >> 394: __ jmp(L_DefaultLoop); > > Both sub and cmp are flag affecting instructions and are macro-fusible. > By doing a loop rotation i.e. moving the length <= 0 check outside the loop > and pushing the loop exit check at bottom you can save additional compare > checks. Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add alignment and loop rotation, but being a fallback figured its more important to keep it small - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486768 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486717 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486848
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Thu, 11 Apr 2024 17:15:21 GMT, Anthony Scarpino wrote: >>> In `ECOperations.java`, if I understand this correctly, it is to replace >>> the existing `PointMultiplier` with montgomery-based PointMuliplier. But >>> when I look at the code, I see both are still options. If I read this >>> correctly, it checks for the old `IntegerFieldModuloP`, then looks for the >>> new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. >>> Why doesn't it just replace the old implementation entry in the `fields` >>> Map? Is there a reason to keep it around? >> >> Hmm, thats a good point I haven't fully considered; i.e. (if I read >> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery >> entirely".. that might also help in cleaning a few things up in the >> construction. Maybe even get rid of this nested ECOperations inside >> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the >> ECC stack clearer is positive! >> >> One functional reason that might justify keeping it as-is, is fuzz-testing; >> with the fallback available, I am able to write the included Fuzz tests and >> have them check the values against the existing implementation. While I also >> included a few KAT tests using openssl-generated values, the fuzz tests >> check millions of values and it does add a lot more certainty about >> correctness of this code. >> >> Can it be removed? For the operations that do not involve multiplication >> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the >> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and >> existing IntegerPolynomialP256 is no longer used (I should verify that >> again) and only P256OrderField remains non-montgomery. So removing >> references to IntegerPolynomialP256 in ECOperations should be possible and >> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 >> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also >> worth some thought.. >> >> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but >> it could also be chucked up as part of 'scaffolding' and removed in name of >> code quality? >> >> Thanks @ascarpino >> >> PS: Perhaps there is some middle ground, remove the `ECOperations >> montgomeryOps` nesting, and construct (somehow?? singleton makes most things >> inaccessible..) the reference ECOperations in the fuzz test instead.. not >> sure how yet, but perhaps worth a further thought.. > >> > In `ECOperations.java`, if I understand this correctly, it is to replace >> > the existing `PointMultiplier` with montgomery-based PointMuliplier. But >> > when I look at the code, I see both are still options. If I read this >> > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the >> > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. >> > Why doesn't it just replace the old implementation entry in the `fields` >> > Map? Is there a reason to keep it around? >> >> Hmm, thats a good point I haven't fully considered; i.e. (if I read >> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery >> entirely".. that might also help in cleaning a few things up in the >> construction. Maybe even get rid of this nested ECOperations inside >> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the >> ECC stack clearer is positive! >> >> One functional reason that might justify keeping it as-is, is fuzz-testing; >> with the fallback available, I am able to write the included Fuzz tests and >> have them check the values against the existing implementation. While I also >> included a few KAT tests using openssl-generated values, the fuzz tests >> check millions of values and it does add a lot more certainty about >> correctness of this code. > > I hadn't looked at your fuzz test until you mentioned it. I see you are > using reflection to change the values. Is that what you mean by "fallback"? > I'm assuming there is no to access the older implementation without > reflection. > >> >> Can it be removed? For the operations that do not involve multiplication >> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the >> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and >> existing IntegerPolynomialP256 is no longer used (I should verify that >> again) and only P256OrderField remains non-montgomery. So removing >> references to IntegerPolynomialP256 in ECOperations should be possible and >> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 >> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also >> worth some thought.. >> >> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but >> it could also be chucked up as part of 'scaffolding' and removed in name of >> code quality? > > I
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Wed, 10 Apr 2024 23:56:52 GMT, Volodymyr Paprotski wrote: > Few early comments. > > Please update the copyright year of all the modified files. > > You can even consider splitting this into two patches, Java side changes in > one and x86 optimized intrinsic in next one. Fixed all copyright years git diff da8a095a19c90e7ee2b45fab9b533a1092887023 | lsdiff -p1 | while read line; do echo $line =; grep Copyright $line | grep -v 2024; done | less Re splitting.. probably too late for that now.. (did consider it initially.. got hard to manage two changes while developing. And easier to justify the change when the entire patch is presented? but yes, far more code to review.. ) - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2057892691
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Comments from Jatin and Tony - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/82b6dae7..6f9ac046 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=02 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=01-02 Stats: 97 lines in 20 files changed: 4 ins; 36 del; 57 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Fri, 5 Apr 2024 09:17:18 GMT, Jatin Bhateja wrote: > Few early comments. > > Please update the copyright year of all the modified files. > > You can even consider splitting this into two patches, Java side changes in > one and x86 optimized intrinsic in next one. Thanks Jatin, will fix! - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048618452
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Wed, 10 Apr 2024 17:18:55 GMT, Anthony Scarpino wrote: > In `ECOperations.java`, if I understand this correctly, it is to replace the > existing `PointMultiplier` with montgomery-based PointMuliplier. But when I > look at the code, I see both are still options. If I read this correctly, it > checks for the old `IntegerFieldModuloP`, then looks for the new > `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. Why > doesn't it just replace the old implementation entry in the `fields` Map? Is > there a reason to keep it around? Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery entirely".. that might also help in cleaning a few things up in the construction. Maybe even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the ECC stack clearer is positive! One functional reason that might justify keeping it as-is, is fuzz-testing; with the fallback available, I am able to write the included Fuzz tests and have them check the values against the existing implementation. While I also included a few KAT tests using openssl-generated values, the fuzz tests check millions of values and it does add a lot more certainty about correctness of this code. Can it be removed? For the operations that do not involve multiplication (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing IntegerPolynomialP256 is no longer used (I should verify that again) and only P256OrderField remains non-montgomery. So removing references to IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some thought.. I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but it could also be chucked up as part of 'scaffolding' and removed in name of code quality? Thanks @ascarpino PS: Perhaps there is some middle ground, remove the `ECOperations montgomeryOps` nesting, and construct (somehow?? singleton makes most things inaccessible..) the reference ECOperations in the fuzz test instead.. not sure how yet, but perhaps worth a further thought.. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048159645
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > remove use of jdk.crypto.ec @ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would appreciate your feedback. - PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2040325424
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
On Tue, 2 Apr 2024 16:29:07 GMT, Alan Bateman wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> remove use of jdk.crypto.ec > > src/java.base/share/classes/module-info.java line 265: > >> 263: jdk.jfr, >> 264: jdk.unsupported, >> 265: jdk.crypto.ec; > > jdk.crypto.ec has been hollowed out since JDK 22, the sun.security.ec are in > java.base. So I don't think you need this qualified export. Thanks, fixed. (Started this when `jdk.crypto.ec` was still in use.. missed a few spots during rebase I guess) - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1548460157
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: remove use of jdk.crypto.ec - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/dbe6cd3b..82b6dae7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=01 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=00-01 Stats: 6 lines in 2 files changed: 0 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583
RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic
Performance. Before: Benchmark(algorithm) (dataSize) (keyLength) (provider) Mode Cnt ScoreError Units SignatureBench.ECDSA.signSHA256withECDSA1024 256 thrpt3 6443.934 ± 6.491 ops/s SignatureBench.ECDSA.signSHA256withECDSA 16384 256 thrpt3 6152.979 ± 4.954 ops/s SignatureBench.ECDSA.verify SHA256withECDSA1024 256 thrpt3 1895.410 ± 36.979 ops/s SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt3 1878.955 ± 45.487 ops/s Benchmark(algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1357.810 ± 26.584 ops/s o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1352.119 ± 23.547 ops/s Benchmark (isMontBench) Mode Cnt ScoreError Units PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± 10.970 ops/s Performance, no intrinsic: Benchmark(algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units SignatureBench.ECDSA.signSHA256withECDSA1024 256 thrpt3 6529.839 ± 42.420 ops/s SignatureBench.ECDSA.signSHA256withECDSA 16384 256 thrpt3 6199.747 ± 133.566 ops/s SignatureBench.ECDSA.verify SHA256withECDSA1024 256 thrpt3 1973.676 ± 54.071 ops/s SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt3 1932.127 ± 35.920 ops/s Benchmark(algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1355.788 ± 29.858 ops/s o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 1346.523 ± 28.722 ops/s Benchmark (isMontBench) Mode Cnt ScoreError Units PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± 10.591 ops/s Performance, **with intrinsics** Benchmark(algorithm) (dataSize) (keyLength) (provider) Mode Cnt Score Error Units SignatureBench.ECDSA.signSHA256withECDSA1024 256 thrpt3 10384.591 ± 65.274 ops/s SignatureBench.ECDSA.signSHA256withECDSA 16384 256 thrpt3 9592.912 ± 236.411 ops/s SignatureBench.ECDSA.verify SHA256withECDSA1024 256 thrpt3 3479.494 ± 44.578 ops/s SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 thrpt3 3402.147 ± 26.772 ops/s Benchmark(algorithm) (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 2527.678 ± 64.791 ops/s o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 EC thrpt3 2541.258 ± 66.634 ops/s Benchmark (isMontBench) Mode Cnt ScoreError Units PolynomialP256Bench.benchMultiply true thrpt3 3021.139 ± 98.289 ops/s Summary on design (see code for 'ASCII art', references and details on math): - Added a new `IntegerPolynomial` field (`MontgomeryIntegerPolynomialP256`) with 52-bit limbs - `getElement(*)/fromMontgomery()` to convert numbers into/out of the field - `ECOperations` is the primary use of the new field - flattened some extra deep nested class hierarchy (also in prep for further other field optimizations) - `forParameters()/multiply()/setSum()` generates numbers in the new field - `ProjectivePoint/Montgomery{Imm|M}utable.asAffine()` to convert out of the new field - Added Fuzz Testing and KAT verified with OpenSSL - Commit messages: - remove trailing whitespace - Remeasure performance - Fix rebase typo - Address comments from Anas and thorough cleanup - conditionalAssign intrinsic - rebase Changes: https://git.openjdk.org/jdk/pull/18583/files Webrev: https://webrevs.openjdk.org/?repo=jdk=18583=00 Issue: https://bugs.openjdk.org/browse/JDK-8329538 Stats: 2335 lines in 34 files changed: 2037 ins; 162 del; 136 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583