Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Andrew Dinn Tue, 23 Jun 2026 03:21:28 -0700

On Tue, 23 Jun 2026 09:04:38 GMT, Andrew Haley <[email protected]> wrote:


> Are both of you absolutely sure you're allowing adequate time for warmup? 
> Sorry if this is a bit rude, but that variance looks really suspicious.

Yeah, it does look wrong . . .

Looking deeper into the test it seems that the problem here is that running the 
benchmark without an intrinsic is actually just inviting an apples to pears 
comparison. However, I  did try with a longer warmup.

The benchmark test is provided in 
`test/micro/org/openjdk/bench/javax/crypto/full/PolynomialP256Bench.java` which 
specifies

@Warmup(iterations = 3, time = 3)
 ```

The run times for the case where there was no intrinsic were as follows

# Warmup Iteration   1: 16360.174 ops/s
# Warmup Iteration   2: 17884.640 ops/s
# Warmup Iteration   3: 15167.575 ops/s
Iteration   1: 18568.732 ops/s
Iteration   2: 18613.381 ops/s
Iteration   3: 19689.476 ops/s
Iteration   4: 25839.174 ops/s
Iteration   5: 23548.499 ops/s
Iteration   6: 14771.007 ops/s
Iteration   7: 14746.814 ops/s
Iteration   8: 14741.437 ops/s


When I ran it again with warmups set to 20 and got this


# Warmup Iteration   1: 16541.351 ops/s
# Warmup Iteration   2: 17866.159 ops/s
# Warmup Iteration   3: 15120.474 ops/s
# Warmup Iteration   4: 18522.697 ops/s
# Warmup Iteration   5: 17991.285 ops/s
# Warmup Iteration   6: 24679.450 ops/s
# Warmup Iteration   7: 17779.349 ops/s
# Warmup Iteration   8: 14321.994 ops/s
# Warmup Iteration   9: 14618.918 ops/s
# Warmup Iteration  10: 14810.129 ops/s
# Warmup Iteration  11: 14798.399 ops/s
# Warmup Iteration  12: 17931.935 ops/s
# Warmup Iteration  13: 18531.543 ops/s
# Warmup Iteration  14: 14800.131 ops/s
# Warmup Iteration  15: 15063.439 ops/s
# Warmup Iteration  16: 15064.537 ops/s
# Warmup Iteration  17: 25140.773 ops/s
# Warmup Iteration  18: 24460.440 ops/s
# Warmup Iteration  19: 21369.918 ops/s
# Warmup Iteration  20: 18584.904 ops/s
Iteration   1: 19175.972 ops/s
Iteration   2: 18969.329 ops/s
Iteration   3: 18845.385 ops/s
Iteration   4: 18518.855 ops/s
Iteration   5: 18363.983 ops/s
Iteration   6: 18085.425 ops/s
Iteration   7: 14879.353 ops/s
Iteration   8: 25664.992 ops/s


>    PolynomialP256Bench.benchAssign           true  thrpt    8  19062.912 ± 
> 5731.302  ops/s

i.e. there is still great variability even after a lot more warmup.

The test code actually does this


    @Benchmark
    public MutableIntegerModuloP benchAssign() {
        MutableIntegerModuloP test1 = X.mutable();
        MutableIntegerModuloP test2 = one.mutable();
        for (int i = 0; i< 10000; i++) {
            test1.conditionalSet(test2, 0);
            test1.conditionalSet(test2, 1);
            test2.conditionalSet(test1, 0);
            test2.conditionalSet(test1, 1);
        }
        return test2;
    }

where `IntegerPolynomial::conditionalSet` is defined as

        public void conditionalSet(IntegerModuloP b, int set) {
            assert IntegerPolynomial.this == b.getField();
            Element other = (Element) b;

            conditionalAssign(set, limbs, other.limbs);
            numAdds = other.numAdds;
        }

and the intrinsic candidate `IntegerPolynomial::conditionalAssign` is defined as

    @ForceInline
    @IntrinsicCandidate
    protected static void conditionalAssign(int set, long[] a, long[] b) {
        int maskValue = -set;
        for (int i = 0; i < a.length; i++) {
            long dummyLimbs = maskValue & (a[i] ^ b[i]);
            a[i] = dummyLimbs ^ a[i];
        }
    }

With an intrinsic in place the loop body cannot really be optimized. The 
callouts to the intrinsic are opaque to the compiler so it just has to make 4 
calls per iteration.

When we disable the intrinsic and compile with Java bytecode the compiler can 
immediately simplify inlined code for the 4 conditionalSet calls based on the 
`set` argument being 0 or 1. When `set` is 0 `maskValue` is 0 so `dummyLimbs` 
is 0 and `a[i]` does not need to be updated. Maybe the compiler can also work 
out some invariants across calls or iterations but the above is already enough 
to mean that the compiler will only be generating code for half the work done 
by the intrinsic. 

That doesn't account for the high variance but it does suggest that the 
comparison is not valuable as it allows the compiler to win on a special case 
(set is known in advance).

So, @ferakocz it would still be useful to know whether disabling the intrinsic 
changes the results of the crypto tests you have run. That would be a good 
reason to consider dropping the intrinsic on aarch64 (likewise, possibly on 
x86). The micro-benchmark itself does not really offer any reason to do so.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30941#issuecomment-4778155974

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Reply via email to