Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
On Thu, 16 May 2024 23:21:36 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one >> additional commit since the last revision: >> >> whitespace > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168: > >> 166: XMMRegister broadcast5 = xmm24; >> 167: KRegister limb0 = k1; >> 168: KRegister limb5 = k2; > > limb5 and select are not being used anymore. Thanks, fixed (and also broadcast5) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185: > >> 183: __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), >> false, Assembler::AVX_512bit, rscratch); >> 184: >> 185: // A = load(*aLimbs) > > A little bit more description in comments on what the load step involves > would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, > or in the lowest limb. Done > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270: > >> 268: __ push(r14); >> 269: __ push(r15); >> 270: > > No need to save/restore rbx, r12, r14, r15. Only r13 is used as temp in > montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed > to r8. Seems I forgot to completely cleanup, thanks! (Originally copied from poly1305 stub) > src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286: > >> 284: __ mov(aLimbs, c_rarg0); >> 285: __ mov(bLimbs, c_rarg1); >> 286: __ mov(rLimbs, c_rarg2); > > We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then > these moves are not necessary. Gave them symbolic names and passed the gpr temp and parameter. vector register map still in the montgomeryMultiply function, but gprs explicitly passed in. 'close enough'? > src/hotspot/cpu/x86/vm_version_x86.cpp line 1370: > >> 1368: >> 1369: #ifdef _LP64 >> 1370: if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize >> >= 64) { > > No need to tie the intrinsic to MaxVectorSize setting. Done > src/hotspot/share/opto/library_call.cpp line 7564: > >> 7562: >> 7563: if (!stubAddr) return false; >> 7564: if (stopped()) return true; > > Line 7564 seems redundant here as there is no range check or anything like > that before this. Oh. That is what that is for... I thought it was some soft of 'VM quitting' short-circuit. Removed. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328906 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328960 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328859 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328829 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605329040 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328995
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
On Fri, 10 May 2024 00:19:32 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > whitespace src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168: > 166: XMMRegister broadcast5 = xmm24; > 167: KRegister limb0 = k1; > 168: KRegister limb5 = k2; limb5 and select are not being used anymore. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185: > 183: __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), > false, Assembler::AVX_512bit, rscratch); > 184: > 185: // A = load(*aLimbs) A little bit more description in comments on what the load step involves would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, or in the lowest limb. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270: > 268: __ push(r14); > 269: __ push(r15); > 270: No need to save/restore rbx, r12, r14, r15. Only r13 is used as temp in montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed to r8. src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286: > 284: __ mov(aLimbs, c_rarg0); > 285: __ mov(bLimbs, c_rarg1); > 286: __ mov(rLimbs, c_rarg2); We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then these moves are not necessary. src/hotspot/cpu/x86/vm_version_x86.cpp line 1370: > 1368: > 1369: #ifdef _LP64 > 1370: if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize > >= 64) { No need to tie the intrinsic to MaxVectorSize setting. src/hotspot/share/opto/library_call.cpp line 7564: > 7562: > 7563: if (!stubAddr) return false; > 7564: if (stopped()) return true; Line 7564 seems redundant here as there is no range check or anything like that before this. - PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604169603 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604141586 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604174141 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604175443 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603792252 PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603865712
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
On Fri, 10 May 2024 00:19:32 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6443.934 ± 6.491 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6152.979 ± 4.954 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1895.410 ± 36.979 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1878.955 ± 45.487 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1357.810 ± 26.584 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1352.119 ± 23.547 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± >> 10.970 ops/s >> >> Performance, no intrinsic: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256 >> thrpt3 6529.839 ± 42.420 ops/s >> SignatureBench.ECDSA.signSHA256withECDSA 16384 256 >> thrpt3 6199.747 ± 133.566 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA1024 256 >> thrpt3 1973.676 ± 54.071 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt3 1932.127 ± 35.920 ops/s >> Benchmark(algorithm) >> (keyLength) (kpgAlgorithm) (provider) Mode Cnt ScoreError Units >> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1355.788 ± 29.858 ops/s >> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH >> 256 EC thrpt3 1346.523 ± 28.722 ops/s >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt3 1919.57... > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > whitespace The changes look good and have passed testing - Marked as reviewed by ascarpino (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-2053158639
Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]
> Performance. Before: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt ScoreError Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6443.934 ± 6.491 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6152.979 ± 4.954 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1895.410 ± 36.979 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1878.955 ± 45.487 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1357.810 ± 26.584 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1352.119 ± 23.547 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply false thrpt3 1746.126 ± > 10.970 ops/s > > Performance, no intrinsic: > > Benchmark(algorithm) (dataSize) (keyLength) > (provider) Mode Cnt Score Error Units > SignatureBench.ECDSA.signSHA256withECDSA1024 256 > thrpt3 6529.839 ± 42.420 ops/s > SignatureBench.ECDSA.signSHA256withECDSA 16384 256 > thrpt3 6199.747 ± 133.566 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA1024 256 > thrpt3 1973.676 ± 54.071 ops/s > SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 > thrpt3 1932.127 ± 35.920 ops/s > Benchmark(algorithm) (keyLength) > (kpgAlgorithm) (provider) Mode Cnt ScoreError Units > o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1355.788 ± 29.858 ops/s > o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH 256 > EC thrpt3 1346.523 ± 28.722 ops/s > Benchmark (isMontBench) Mode Cnt Score > Error Units > PolynomialP256Bench.benchMultiply true thrpt3 1919.574 ± > 10.591 ops/s > > Performance, **with intrinsics*... Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: whitespace - Changes: - all: https://git.openjdk.org/jdk/pull/18583/files - new: https://git.openjdk.org/jdk/pull/18583/files/83b21310..8cd095dd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=18583=08 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/18583.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583 PR: https://git.openjdk.org/jdk/pull/18583