Integrated: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

2024-05-22 Thread Volodymyr Paprotski
On Tue, 2 Apr 2024 15:42:05 GMT, Volodymyr Paprotski  wrote:

> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

This pull request has now been integrated.

Changeset: afed7d0b
Author:Volodymyr Paprotski 
Committer: Sandhya Viswanathan 
URL:   
https://git.openjdk.org/jdk/commit/afed7d0b0593864e5595840a6b645c210ff28c7c
Stats: 2409 lines in 36 files changed: 2093 ins; 156 del; 160 mod

8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

Reviewed-by: ihse, ascarpino, sviswanathan

-

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-22 Thread Volodymyr Paprotski
On Tue, 21 May 2024 17:41:46 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request with a new target base due 
> to a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 17 additional 
> commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/master' into ecc-montgomery
>  - shenandoah verifier
>  - comments from Sandhya
>  - whitespace
>  - add message back
>  - whitespace
>  - Use AffinePoint to exit Montgomery domain
>
>Style notes:
>Affine.equals()
>- Mismatched fields only appear to be used from testing, perhaps 
> should be moved there instead
>Affine.getX(boolean)|getY(boolean)
>- "Passing flag is bad design" - cleanest/performant alternative to 
> several instanceof checks
>- needed to convert Affine to Projective (need to stay in montgomery 
> domain)
>ECOperations.PointMultiplier
>   - changes could probably be restored to original (since ProjectivePoint 
> handling no longer required)
>   - consider these changes an improvement? (fewer nested classes)
>   - was an inner-class but not using inner-class features (i.e. ecOps 
> variable should be converted)
>  - whitespace
>  - Comments from Tony and Jatin
>  - Comments from Jatin and Tony
>  - ... and 7 more: https://git.openjdk.org/jdk/compare/c0032e2c...b1a33004

Thanks Tobi!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2124924526


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v12]

2024-05-21 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request with a new target base due to 
a merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 17 additional commits 
since the last revision:

 - Merge remote-tracking branch 'origin/master' into ecc-montgomery
 - shenandoah verifier
 - comments from Sandhya
 - whitespace
 - add message back
 - whitespace
 - Use AffinePoint to exit Montgomery domain
   
   Style notes:
   Affine.equals()
   - Mismatched fields only appear to be used from testing, perhaps should 
be moved there instead
   Affine.getX(boolean)|getY(boolean)
   - "Passing flag is bad design" - cleanest/performant alternative to 
several instanceof checks
   - needed to convert Affine to Projective (need to stay in montgomery 
domain)
   ECOperations.PointMultiplier
  - changes could probably be restored to original (since ProjectivePoint 
handling no longer required)
  - consider these changes an improvement? (fewer nested classes)
  - was an inner-class but not using inner-class features (i.e. ecOps 
variable should be converted)
 - whitespace
 - Comments from Tony and Jatin
 - Comments from Jatin and Tony
 - ... and 7 more: https://git.openjdk.org/jdk/compare/12e8009b...b1a33004

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/df4fe6fa..b1a33004

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=11
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=10-11

  Stats: 190975 lines in 3949 files changed: 105304 ins; 64688 del; 20983 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]

2024-05-21 Thread Volodymyr Paprotski
On Tue, 21 May 2024 07:21:14 GMT, Tobias Hartmann  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   shenandoah verifier
>
> I'm getting some conflicts when trying to apply this to master. Could you 
> please merge the PR?

Hi @TobiHartmann , merged with no issues for me. Could you please run the tests 
again? (I think Tony did run them, but can't hurt verifying again). Thanks!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2123122468


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]

2024-05-17 Thread Volodymyr Paprotski
On Fri, 17 May 2024 21:16:47 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   shenandoah verifier

Thanks Sandhya!

Now that I have @ascarpino approval as well, I plan to integrate next Tuesday.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2118443577


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v11]

2024-05-17 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  shenandoah verifier

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/5c360e35..df4fe6fa

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=10
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=09-10

  Stats: 7 lines in 2 files changed: 6 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-17 Thread Volodymyr Paprotski
On Thu, 16 May 2024 23:21:36 GMT, Sandhya Viswanathan 
 wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   whitespace
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 168:
> 
>> 166:   XMMRegister broadcast5 = xmm24;
>> 167:   KRegister limb0 = k1;
>> 168:   KRegister limb5 = k2;
> 
> limb5 and select are not being used anymore.

Thanks, fixed (and also broadcast5)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 185:
> 
>> 183:   __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), 
>> false, Assembler::AVX_512bit, rscratch);
>> 184: 
>> 185:   // A = load(*aLimbs)
> 
> A little bit more description in comments on what the load step involves 
> would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, 
> or in the lowest limb.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 270:
> 
>> 268:   __ push(r14);
>> 269:   __ push(r15);
>> 270: 
> 
> No need to save/restore  rbx, r12, r14, r15.  Only r13 is used as temp in 
> montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed 
> to r8.

Seems I forgot to completely cleanup, thanks! (Originally copied from poly1305 
stub)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 286:
> 
>> 284:   __ mov(aLimbs, c_rarg0);
>> 285:   __ mov(bLimbs, c_rarg1);
>> 286:   __ mov(rLimbs, c_rarg2);
> 
> We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then 
> these moves are not necessary.

Gave them symbolic names and passed the gpr temp and parameter. vector register 
map still in the montgomeryMultiply function, but gprs explicitly passed in. 
'close enough'?

> src/hotspot/cpu/x86/vm_version_x86.cpp line 1370:
> 
>> 1368: 
>> 1369: #ifdef _LP64
>> 1370:   if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize 
>> >= 64) {
> 
> No need to tie the intrinsic to MaxVectorSize setting.

Done

> src/hotspot/share/opto/library_call.cpp line 7564:
> 
>> 7562: 
>> 7563:   if (!stubAddr) return false;
>> 7564:   if (stopped())  return true;
> 
> Line 7564 seems redundant here as there is no range check or anything like 
> that  before this.

Oh. That is what that is for...  I thought it was some soft of 'VM quitting' 
short-circuit. Removed.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328906
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328960
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328859
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328829
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605329040
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1605328995


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v10]

2024-05-17 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  comments from Sandhya

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/8cd095dd..5c360e35

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=09
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=08-09

  Stats: 82 lines in 4 files changed: 1 ins; 59 del; 22 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v9]

2024-05-09 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/83b21310..8cd095dd

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=08
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=07-08

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]

2024-05-09 Thread Volodymyr Paprotski
On Thu, 9 May 2024 23:36:03 GMT, Anthony Scarpino  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   whitespace
>
> src/java.base/share/classes/sun/security/ec/ECOperations.java line 701:
> 
>> 699: if (!m.equals(v)) {
>> 700: java.util.HexFormat hex = 
>> java.util.HexFormat.of();
>> 701: throw new RuntimeException();
> 
> I think your cleanup went to far. You should have some message saying they 
> are not equal and if you don't want to print hex, remove getting an instance.

I put the message back.. I removed it 'half'-intentionally; Was comparing 
against the original version and it didn't have  any details, thought maybe 
should follow suit. But I did find this message helpful, so its back.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1596116606


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v8]

2024-05-09 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  add message back

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/1ecfdc44..83b21310

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=07
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=06-07

  Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v7]

2024-05-09 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/8ff243a2..1ecfdc44

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=06
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=05-06

  Stats: 753 lines in 9 files changed: 303 ins; 101 del; 349 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v6]

2024-05-06 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  Use AffinePoint to exit Montgomery domain
  
  Style notes:
  Affine.equals()
  - Mismatched fields only appear to be used from testing, perhaps should 
be moved there instead
  Affine.getX(boolean)|getY(boolean)
  - "Passing flag is bad design" - cleanest/performant alternative to 
several instanceof checks
  - needed to convert Affine to Projective (need to stay in montgomery 
domain)
  ECOperations.PointMultiplier
 - changes could probably be restored to original (since ProjectivePoint 
handling no longer required)
 - consider these changes an improvement? (fewer nested classes)
 - was an inner-class but not using inner-class features (i.e. ecOps 
variable should be converted)

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/a1984501..8ff243a2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=05
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=04-05

  Stats: 268 lines in 7 files changed: 89 ins; 147 del; 32 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v5]

2024-04-25 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/c93a71f0..a1984501

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=04
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=03-04

  Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]

2024-04-24 Thread Volodymyr Paprotski
On Tue, 23 Apr 2024 19:55:57 GMT, Anthony Scarpino  
wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Comments from Jatin and Tony
>
> src/java.base/share/classes/sun/security/ec/ECOperations.java line 204:
> 
>> 202:  * @return the product
>> 203:  */
>> 204: public MutablePoint multiply(AffinePoint affineP, byte[] s) {
> 
> It seems like there could be some combining of both `multiply()`.  If 
> `multiply(AffinePoint, ...)` is called, it can call `DefaultMultiplier` with 
> the `affineP`, but internally call the other `multiply(ECPoint, ...)` for the 
> other situations.  I'd rather not have two methods doing most of the same 
> code, but different methods.

Thanks, they indeed look identical, didnt notice. Fixed. (repeated the same 
hashmap refactoring and didnt notice I produced identical code twice)

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 467:
> 
>> 465: sealed static abstract class SmallWindowMultiplier implements 
>> PointMultiplier
>> 466: permits DefaultMultiplier, DefaultMontgomeryMultiplier {
>> 467: private final AffinePoint affineP;
> 
> I don't think `affineP` needs to be a class variable anymore.  It's only used 
> in the constructor

Didn't notice, thanks, fixed.

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 592:
> 
>> 590: }
>> 591: 
>> 592: private final ProjectivePoint.Immutable[][] points;
> 
> Can you define this at the top please.

Done

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 668:
> 
>> 666: }
>> 667: 
>> 668: private final BigInteger[] base;
> 
> Can you define this at the top.  You use it in the constructor but it's 
> defined later on.

Done

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578117929
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578147190
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578148562
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578150303


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-24 Thread Volodymyr Paprotski
On Tue, 9 Apr 2024 02:01:36 GMT, Anthony Scarpino  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/java.base/share/classes/sun/security/ec/ECOperations.java line 308:
> 
>> 306: 
>> 307: /*
>> 308:  * public Point addition. Used by ECDSAOperations
> 
> Was the old description not applicable anymore?   It would be nice to improve 
> on the existing description that shortening it.

Forgot to go back and fix the comment. Fixed..

As for the 'meaning'. Notice the signature of the function changed (i.e. no 
longer a 'mixed point', but two ProjectivePoints. This is a good idea 
regardless of Montgomery, but it affects montgomery particularly badly (need to 
compute zInv for 'no reason'. )

For sake of completeness. Apart from constructor, the 'API' for ECOperations 
(i.e. as used by ECDHE, ECDSAOperations and KeyGeneration) are these three 
functions (everything else is used internally by this class)

public void setSum(MutablePoint p, MutablePoint p2)
public MutablePoint multiply(AffinePoint affineP, byte[] s)
public MutablePoint multiply(ECPoint ecPoint, byte[] s)

> src/java.base/share/classes/sun/security/ec/ECOperations.java line 321:
> 
>> 319: ECOperations ops = this;
>> 320: if (this.montgomeryOps != null) {
>> 321: assert p.getField() instanceof 
>> IntegerMontgomeryFieldModuloP;
> 
> This should throw a ProviderException, I believe this would throw an 
> AssertionException

Missed this comment. No longer applicable (this.montgomeryOps got refactored 
away)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578144125
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578161140


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v4]

2024-04-24 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  Comments from Tony and Jatin

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/6f9ac046..c93a71f0

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=03
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=02-03

  Stats: 48 lines in 2 files changed: 20 ins; 20 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-24 Thread Volodymyr Paprotski
On Tue, 16 Apr 2024 02:26:57 GMT, Jatin Bhateja  wrote:

>> Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add 
>> alignment and loop rotation, but being a fallback figured its more important 
>> to keep it small
>
> It's all part of intrinsic, no harm in polishing it.

Done (normalized loop/backedge). There was actually a problem in the loop 
counter.. (`i-=1` instead of `i-=16`). Can't include a test since classes are 
sealed, but verified manually.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1578172873


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski
On Fri, 5 Apr 2024 07:19:28 GMT, Jatin Bhateja  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 39:
> 
>> 37: };
>> 38: static address modulus_p256() {
>> 39:   return (address)MODULUS_P256;
> 
> Long constants should have UL suffix.

Properly ULL, but good point, fixed

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 386:
> 
>> 384:   __ jcc(Assembler::equal, L_Length19);
>> 385: 
>> 386:   // Default copy loop
> 
> Please add appropriate loop entry alignment.

This is actually a 'switch statement default'. The default should never happen 
(See "Known Length comment on line 335"), but added because java code has that 
behavior. (i.e. in the unlikely case NIST adds a new elliptic curve to the 
existing standard?)

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394:
> 
>> 392:   __ lea(aLimbs, Address(aLimbs,8));
>> 393:   __ lea(bLimbs, Address(bLimbs,8));
>> 394:   __ jmp(L_DefaultLoop);
> 
> Both sub and cmp are flag affecting instructions and are macro-fusible. 
> By doing a loop rotation i.e. moving the length <= 0 check outside the loop 
> and pushing the loop exit check at bottom you can save additional compare 
> checks.

Per-above, this is a switch statement (`UNLIKELY`) fallback. I can still add 
alignment and loop rotation, but being a fallback figured its more important to 
keep it small

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486768
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486717
PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1566486848


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski
On Thu, 11 Apr 2024 17:15:21 GMT, Anthony Scarpino  
wrote:

>>> In `ECOperations.java`, if I understand this correctly, it is to replace 
>>> the existing `PointMultiplier` with montgomery-based PointMuliplier. But 
>>> when I look at the code, I see both are still options. If I read this 
>>> correctly, it checks for the old `IntegerFieldModuloP`, then looks for the 
>>> new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. 
>>> Why doesn't it just replace the old implementation entry in the `fields` 
>>> Map? Is there a reason to keep it around?
>> 
>> Hmm, thats a good point I haven't fully considered; i.e. (if I read 
>> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery 
>> entirely".. that might also help in cleaning a few things up in the 
>> construction. Maybe even get rid of this nested ECOperations inside 
>> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the 
>> ECC stack clearer is positive!
>> 
>> One functional reason that might justify keeping it as-is, is fuzz-testing; 
>> with the fallback available, I am able to write the included Fuzz tests and 
>> have them check the values against the existing implementation. While I also 
>> included a few KAT tests using openssl-generated values, the fuzz tests 
>> check millions of values and it does add a lot more certainty about 
>> correctness of this code.
>> 
>> Can it be removed? For the operations that do not involve multiplication 
>> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the 
>> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and 
>> existing IntegerPolynomialP256 is no longer used (I should verify that 
>> again) and only P256OrderField remains non-montgomery. So removing 
>> references to IntegerPolynomialP256 in ECOperations should be possible and 
>> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 
>> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also 
>> worth some thought..
>> 
>> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but 
>> it could also be chucked up as part of 'scaffolding' and removed in name of 
>> code quality?
>> 
>> Thanks @ascarpino 
>> 
>> PS: Perhaps there is some middle ground, remove the `ECOperations 
>> montgomeryOps` nesting, and construct (somehow?? singleton makes most things 
>> inaccessible..) the reference ECOperations in the fuzz test instead.. not 
>> sure how yet, but perhaps worth a further thought..
>
>> > In `ECOperations.java`, if I understand this correctly, it is to replace 
>> > the existing `PointMultiplier` with montgomery-based PointMuliplier. But 
>> > when I look at the code, I see both are still options. If I read this 
>> > correctly, it checks for the old `IntegerFieldModuloP`, then looks for the 
>> > new `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. 
>> > Why doesn't it just replace the old implementation entry in the `fields` 
>> > Map? Is there a reason to keep it around?
>> 
>> Hmm, thats a good point I haven't fully considered; i.e. (if I read 
>> correctly) "for `CurveDB.P_256` remove the fallback path to non-montgomery 
>> entirely".. that might also help in cleaning a few things up in the 
>> construction. Maybe even get rid of this nested ECOperations inside 
>> ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the 
>> ECC stack clearer is positive!
>> 
>> One functional reason that might justify keeping it as-is, is fuzz-testing; 
>> with the fallback available, I am able to write the included Fuzz tests and 
>> have them check the values against the existing implementation. While I also 
>> included a few KAT tests using openssl-generated values, the fuzz tests 
>> check millions of values and it does add a lot more certainty about 
>> correctness of this code.
> 
> I hadn't looked at your fuzz test until you mentioned it.  I see you are 
> using reflection to change the values.  Is that what you mean by "fallback"?  
> I'm assuming there is no to access the older implementation without 
> reflection.
> 
>> 
>> Can it be removed? For the operations that do not involve multiplication 
>> (i.e. `setSum(*)`), montgomery is expensive. I think I did go through the 
>> uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and 
>> existing IntegerPolynomialP256 is no longer used (I should verify that 
>> again) and only P256OrderField remains non-montgomery. So removing 
>> references to IntegerPolynomialP256 in ECOperations should be possible and 
>> cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 
>> is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also 
>> worth some thought..
>> 
>> I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but 
>> it could also be chucked up as part of 'scaffolding' and removed in name of 
>> code quality?
> 
> I 

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Volodymyr Paprotski
On Wed, 10 Apr 2024 23:56:52 GMT, Volodymyr Paprotski  wrote:

> Few early comments.
> 
> Please update the copyright year of all the modified files.
> 
> You can even consider splitting this into two patches, Java side changes in 
> one and x86 optimized intrinsic in next one.

Fixed all copyright years

git diff da8a095a19c90e7ee2b45fab9b533a1092887023 | lsdiff -p1 | while read 
line; do echo $line =; grep Copyright $line | grep -v 
2024; done | less

Re splitting.. probably too late for that now.. (did consider it initially.. 
got hard to manage two changes while developing. And easier to justify the 
change when the entire patch is presented? but yes, far more code to review.. )

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2057892691


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v3]

2024-04-15 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  Comments from Jatin and Tony

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/82b6dae7..6f9ac046

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=02
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=01-02

  Stats: 97 lines in 20 files changed: 4 ins; 36 del; 57 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-10 Thread Volodymyr Paprotski
On Fri, 5 Apr 2024 09:17:18 GMT, Jatin Bhateja  wrote:

> Few early comments.
> 
> Please update the copyright year of all the modified files.
> 
> You can even consider splitting this into two patches, Java side changes in 
> one and x86 optimized intrinsic in next one.

Thanks Jatin, will fix!

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048618452


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-10 Thread Volodymyr Paprotski
On Wed, 10 Apr 2024 17:18:55 GMT, Anthony Scarpino  
wrote:

> In `ECOperations.java`, if I understand this correctly, it is to replace the 
> existing `PointMultiplier` with montgomery-based PointMuliplier. But when I 
> look at the code, I see both are still options. If I read this correctly, it 
> checks for the old `IntegerFieldModuloP`, then looks for the new 
> `IntegerMontgomeryFieldModuloP`. It appears to use the new one always. Why 
> doesn't it just replace the old implementation entry in the `fields` Map? Is 
> there a reason to keep it around?

Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) 
"for `CurveDB.P_256` remove the fallback path to non-montgomery entirely".. 
that might also help in cleaning a few things up in the construction. Maybe 
even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting 
isnt a big deal, but all attempts to make the ECC stack clearer is positive!

One functional reason that might justify keeping it as-is, is fuzz-testing; 
with the fallback available, I am able to write the included Fuzz tests and 
have them check the values against the existing implementation. While I also 
included a few KAT tests using openssl-generated values, the fuzz tests check 
millions of values and it does add a lot more certainty about correctness of 
this code.

Can it be removed? For the operations that do not involve multiplication (i.e. 
`setSum(*)`), montgomery is expensive. I think I did go through the uses of 
this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing 
IntegerPolynomialP256 is no longer used (I should verify that again) and only 
P256OrderField remains non-montgomery. So removing references to 
IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing 
IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder 
(fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some 
thought..

I tend to like `ECOperationsFuzzTest.java` and would prefer to keep it, but it 
could also be chucked up as part of 'scaffolding' and removed in name of code 
quality?

Thanks @ascarpino 

PS: Perhaps there is some middle ground, remove the `ECOperations 
montgomeryOps` nesting, and construct (somehow?? singleton makes most things 
inaccessible..) the reference ECOperations in the fuzz test instead.. not sure 
how yet, but perhaps worth a further thought..

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2048159645


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-05 Thread Volodymyr Paprotski
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski  wrote:

>> Performance. Before:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt ScoreError  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6443.934 ±  6.491  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6152.979 ±  4.954  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1895.410 ± 36.979  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1878.955 ± 45.487  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1357.810 ± 26.584  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1352.119 ± 23.547  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
>> 10.970  ops/s
>> 
>> Performance, no intrinsic:
>> 
>> Benchmark(algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt Score Error  Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024  256
>>   thrpt3  6529.839 ±  42.420  ops/s
>> SignatureBench.ECDSA.signSHA256withECDSA   16384  256
>>   thrpt3  6199.747 ± 133.566  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256
>>   thrpt3  1973.676 ±  54.071  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256
>>   thrpt3  1932.127 ±  35.920  ops/s
>> Benchmark(algorithm)  
>> (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
>> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  
>> 256  EC  thrpt3  1355.788 ± 29.858  ops/s
>> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  
>> 256  EC  thrpt3  1346.523 ± 28.722  ops/s
>> Benchmark  (isMontBench)   Mode  Cnt Score
>> Error  Units
>> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.57...
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   remove use of jdk.crypto.ec

@ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would 
appreciate your feedback.

-

PR Comment: https://git.openjdk.org/jdk/pull/18583#issuecomment-2040325424


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-02 Thread Volodymyr Paprotski
On Tue, 2 Apr 2024 16:29:07 GMT, Alan Bateman  wrote:

>> Volodymyr Paprotski has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   remove use of jdk.crypto.ec
>
> src/java.base/share/classes/module-info.java line 265:
> 
>> 263: jdk.jfr,
>> 264: jdk.unsupported,
>> 265: jdk.crypto.ec;
> 
> jdk.crypto.ec has been hollowed out since JDK 22, the sun.security.ec are in 
> java.base. So I don't think you need this qualified export.

Thanks, fixed. (Started this when `jdk.crypto.ec` was still in use.. missed a 
few spots during rebase I guess)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18583#discussion_r1548460157


Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-02 Thread Volodymyr Paprotski
> Performance. Before:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt ScoreError  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6443.934 ±  6.491  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6152.979 ±  4.954  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1895.410 ± 36.979  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1878.955 ± 45.487  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1357.810 ± 26.584  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1352.119 ± 23.547  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 
> 10.970  ops/s
> 
> Performance, no intrinsic:
> 
> Benchmark(algorithm)  (dataSize)  (keyLength)  
> (provider)   Mode  Cnt Score Error  Units
> SignatureBench.ECDSA.signSHA256withECDSA1024  256 
>  thrpt3  6529.839 ±  42.420  ops/s
> SignatureBench.ECDSA.signSHA256withECDSA   16384  256 
>  thrpt3  6199.747 ± 133.566  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA1024  256 
>  thrpt3  1973.676 ±  54.071  ops/s
> SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256 
>  thrpt3  1932.127 ±  35.920  ops/s
> Benchmark(algorithm)  (keyLength) 
>  (kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
> o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256 
>  EC  thrpt3  1355.788 ± 29.858  ops/s
> o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256 
>  EC  thrpt3  1346.523 ± 28.722  ops/s
> Benchmark  (isMontBench)   Mode  Cnt Score
> Error  Units
> PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 
> 10.591  ops/s
> 
> Performance, **with intrinsics*...

Volodymyr Paprotski has updated the pull request incrementally with one 
additional commit since the last revision:

  remove use of jdk.crypto.ec

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/18583/files
  - new: https://git.openjdk.org/jdk/pull/18583/files/dbe6cd3b..82b6dae7

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=18583=01
 - incr: https://webrevs.openjdk.org/?repo=jdk=18583=00-01

  Stats: 6 lines in 2 files changed: 0 ins; 1 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583


RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

2024-04-02 Thread Volodymyr Paprotski
Performance. Before:

Benchmark(algorithm)  (dataSize)  (keyLength)  
(provider)   Mode  Cnt ScoreError  Units
SignatureBench.ECDSA.signSHA256withECDSA1024  256   
   thrpt3  6443.934 ±  6.491  ops/s
SignatureBench.ECDSA.signSHA256withECDSA   16384  256   
   thrpt3  6152.979 ±  4.954  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA1024  256   
   thrpt3  1895.410 ± 36.979  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256   
   thrpt3  1878.955 ± 45.487  ops/s
Benchmark(algorithm)  (keyLength)  
(kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256   
   EC  thrpt3  1357.810 ± 26.584  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256   
   EC  thrpt3  1352.119 ± 23.547  ops/s
Benchmark  (isMontBench)   Mode  Cnt ScoreError 
 Units
PolynomialP256Bench.benchMultiply  false  thrpt3  1746.126 ± 10.970 
 ops/s

Performance, no intrinsic:

Benchmark(algorithm)  (dataSize)  (keyLength)  
(provider)   Mode  Cnt Score Error  Units
SignatureBench.ECDSA.signSHA256withECDSA1024  256   
   thrpt3  6529.839 ±  42.420  ops/s
SignatureBench.ECDSA.signSHA256withECDSA   16384  256   
   thrpt3  6199.747 ± 133.566  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA1024  256   
   thrpt3  1973.676 ±  54.071  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256   
   thrpt3  1932.127 ±  35.920  ops/s
Benchmark(algorithm)  (keyLength)  
(kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256   
   EC  thrpt3  1355.788 ± 29.858  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256   
   EC  thrpt3  1346.523 ± 28.722  ops/s
Benchmark  (isMontBench)   Mode  Cnt ScoreError 
 Units
PolynomialP256Bench.benchMultiply   true  thrpt3  1919.574 ± 10.591 
 ops/s

Performance, **with intrinsics**

Benchmark(algorithm)  (dataSize)  (keyLength)  
(provider)   Mode  Cnt  Score Error  Units
SignatureBench.ECDSA.signSHA256withECDSA1024  256   
   thrpt3  10384.591 ±  65.274  ops/s
SignatureBench.ECDSA.signSHA256withECDSA   16384  256   
   thrpt3   9592.912 ± 236.411  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA1024  256   
   thrpt3   3479.494 ±  44.578  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA   16384  256   
   thrpt3   3402.147 ±  26.772  ops/s
Benchmark(algorithm)  (keyLength)  
(kpgAlgorithm)  (provider)   Mode  Cnt ScoreError  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret  ECDH  256   
   EC  thrpt3  2527.678 ± 64.791  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret ECDH  256   
   EC  thrpt3  2541.258 ± 66.634  ops/s
Benchmark  (isMontBench)   Mode  Cnt ScoreError 
 Units
PolynomialP256Bench.benchMultiply   true  thrpt3  3021.139 ± 98.289 
 ops/s


Summary on design (see code for 'ASCII art', references and details on math):
- Added a new `IntegerPolynomial` field (`MontgomeryIntegerPolynomialP256`) 
with 52-bit limbs
   - `getElement(*)/fromMontgomery()` to convert numbers into/out of the field
 - `ECOperations` is the primary use of the new field
   - flattened some extra deep nested class hierarchy (also in prep for further 
other field optimizations)
   - `forParameters()/multiply()/setSum()` generates numbers in the new field
 - `ProjectivePoint/Montgomery{Imm|M}utable.asAffine()` to convert out of the 
new field
 - Added Fuzz Testing and KAT verified with OpenSSL

-

Commit messages:
 - remove trailing whitespace
 - Remeasure performance
 - Fix rebase typo
 - Address comments from Anas and thorough cleanup
 - conditionalAssign intrinsic
 - rebase

Changes: https://git.openjdk.org/jdk/pull/18583/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk=18583=00
  Issue: https://bugs.openjdk.org/browse/JDK-8329538
  Stats: 2335 lines in 34 files changed: 2037 ins; 162 del; 136 mod
  Patch: https://git.openjdk.org/jdk/pull/18583.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583

PR: https://git.openjdk.org/jdk/pull/18583