Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-17 Thread Volodymyr Paprotski
On Thu, 16 May 2024 23:21:36 GMT, Sandhya Viswanathan 
 wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   whitespace
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168:
> 
>> 166:   XMMRegister broadcast5 = xmm24;
>> 167:   KRegister limb0 = k1;
>> 168:   KRegister limb5 = k2;
> 
> limb5 and select are not being used anymore.

Thanks, fixed (and also broadcast5)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185:
> 
>> 183:   __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), 
>> false, Assembler::AVX_512bit, rscratch);
>> 184: 
>> 185:   // A = load(*aLimbs)
> 
> A little bit more description in comments on what the load step involves 
> would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, 
> or in the lowest limb.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270:
> 
>> 268:   __ push(r14);
>> 269:   __ push(r15);
>> 270: 
> 
> No need to save/restore  rbx, r12, r14, r15.  Only r13 is used as temp in 
> montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed 
> to r8.

Seems I forgot to completely cleanup, thanks! (Originally copied from poly1305 
stub)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286:
> 
>> 284:   __ mov(aLimbs, c_rarg0);
>> 285:   __ mov(bLimbs, c_rarg1);
>> 286:   __ mov(rLimbs, c_rarg2);
> 
> We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then 
> these moves are not necessary.

Gave them symbolic names and passed the gpr temp and parameter. vector register 
map still in the montgomeryMultiply function, but gprs explicitly passed in. 
'close enough'?

> src/hotspot/cpu/x86/vm_version_x86.cpp line 1370:
> 
>> 1368: 
>> 1369: #ifdef _LP64
>> 1370:   if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize 
>> >= 64) {
> 
> No need to tie the intrinsic to MaxVectorSize setting.

Done

> src/hotspot/share/opto/library_call.cpp line 7564:
> 
>> 7562: 
>> 7563:   if (!stubAddr) return false;
>> 7564:   if (stopped())  return true;
> 
> Line 7564 seems redundant here as there is no range check or anything like 
> that  before this.

Oh. That is what that is for...  I thought it was some soft of 'VM quitting' 
short-circuit. Removed.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328906
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328960
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328859
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328829
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605329040
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328995


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-16 Thread Sandhya Viswanathan
On Fri, 10 May 2024 00:19:32 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   whitespace

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168:

> 166:   XMMRegister broadcast5 = xmm24;
> 167:   KRegister limb0 = k1;
> 168:   KRegister limb5 = k2;

limb5 and select are not being used anymore.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185:

> 183:   __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), 
> false, Assembler::AVX_512bit, rscratch);
> 184: 
> 185:   // A = load(*aLimbs)

A little bit more description in comments on what the load step involves would 
be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, or in the 
lowest limb.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270:

> 268:   __ push(r14);
> 269:   __ push(r15);
> 270: 

No need to save/restore  rbx, r12, r14, r15.  Only r13 is used as temp in 
montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed to 
r8.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286:

> 284:   __ mov(aLimbs, c_rarg0);
> 285:   __ mov(bLimbs, c_rarg1);
> 286:   __ mov(rLimbs, c_rarg2);

We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then these 
moves are not necessary.

src/hotspot/cpu/x86/vm_version_x86.cpp line 1370:

> 1368: 
> 1369: #ifdef _LP64
> 1370:   if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize 
> >= 64) {

No need to tie the intrinsic to MaxVectorSize setting.

src/hotspot/share/opto/library_call.cpp line 7564:

> 7562: 
> 7563:   if (!stubAddr) return false;
> 7564:   if (stopped())  return true;

Line 7564 seems redundant here as there is no range check or anything like that 
 before this.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604169603
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604141586
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604174141
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1604175443
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603792252
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1603865712


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-13 Thread Anthony Scarpino
On Fri, 10 May 2024 00:19:32 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   whitespace

The changes look good and have passed testing

-

Marked as reviewed by ascarpino (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18583#pullrequestreview-2053158639


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-09 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/83b21310..8cd095dd

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=08
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=07-08

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583