aherbert commented on PR #140:
URL: https://github.com/apache/commons-numbers/pull/140#issuecomment-1974831748
Installed the numbers-core module locally (forgot to do this at first).
MacBook Pro M2:
```
mvn -v
Apache Maven 3.9.4 (dfbb324ad4a7c8fb0bf182e6d91b0ae20e3d2dd9)
Maven home: /Users/ah403/mvn/mvn
Java version: 21.0.1, vendor: Eclipse Adoptium, runtime:
/Library/Java/JavaVirtualMachines/temurin-21.jdk/Contents/Home
Default locale: en_GB, platform encoding: UTF-8
OS name: "mac os x", version: "14.2.1", arch: "aarch64", family: "mac"
```
I did also check JDK 11 and saw the same results so using JDK 21 is not the
reason for the difference.
With benchmarking it is recommended to avoid `final` input (as the JVM may
recognise this and learn what to do). So I would prefer the benchmark without
final or fixed data.
I changed the benchmark to build new data each iteration:
```java
@Setup(Level.Iteration)
public void setup() {
values = getRandomProvider(seed).ints()
.filter(i -> i != Integer.MIN_VALUE).
limit(numPairs * 2)
.toArray();
seed = (((long) values[0]) << Integer.SIZE) | values[1];
}
@Setup(Level.Iteration)
public void setup() {
values = getRandomProvider(seed).longs()
.filter(i -> i != Long.MIN_VALUE)
.limit(numPairs * 2)
.toArray();
seed = values[0];
}
```
I find you get more stable timings with more than 1000 pairs. Here are
100000 pairs reseeded each time:
```
Benchmark (numPairs) (seed) Mode Cnt
Score Error Units
GcdPerformance.gcdBigInteger 100000 42 thrpt 10
21.217 ± 0.117 ops/s
GcdPerformance.gcdInt 100000 42 thrpt 10
231.625 ± 2.014 ops/s
GcdPerformance.gcdLong 100000 42 thrpt 10
115.497 ± 0.446 ops/s
GcdPerformance.oldGcdInt 100000 42 thrpt 10
253.902 ± 2.531 ops/s
GcdPerformance.oldGcdIntAdaptedForLong 100000 42 thrpt 10
68.111 ± 0.909 ops/s
GcdPerformance.oldGcdLong 100000 42 thrpt 10
29.181 ± 0.037 ops/s
```
I see the `oldGcdInt` is faster than the new `gcdInt`. The
`oldGcdIntAdaptedForLong` is out performed by the new `gcdLong`. This is
interesting given the trivial difference between them.
```
# average time
java -jar target/examples-jmh.jar Gcd -pnumPairs=100000 -bm avgt -tu ns
Benchmark (numPairs) (seed) Mode Cnt
Score Error Units
GcdPerformance.gcdBigInteger 100000 42 avgt 10
47321374.441 ± 316079.389 ns/op
GcdPerformance.gcdInt 100000 42 avgt 10
4365814.289 ± 34222.618 ns/op
GcdPerformance.gcdLong 100000 42 avgt 10
8711690.051 ± 25766.036 ns/op
GcdPerformance.oldGcdInt 100000 42 avgt 10
3945841.489 ± 35975.472 ns/op
GcdPerformance.oldGcdIntAdaptedForLong 100000 42 avgt 10
14703364.077 ± 539905.259 ns/op
GcdPerformance.oldGcdLong 100000 42 avgt 10
34265752.087 ± 206487.358 ns/op
```
Same.
So on this machine it seems that the `oldGcdInt` is OK, but adapting that
method for long requires the change to not use `while (a != b)`.
If this is repeatable on other machines then I think the new `gcdLong`
implementation should have a comment to state why it is different from the
`gcdInt`, e.g.
```
// This method is intentionally different from the int gcd implementation.
// Benchmarking shows the test for long inequality (a != b) is slow compared
to
// testing for equality of the delta to zero. The same change on the int gcd
// reduces performance there, hence we have two variants of the method.
```
I thought it may be due to the fact that the the loop is executed more often
for 64-bit numbers and the change aided in pipelining as the delta is required
more often. So I recompiled with the long data limited to 32-bit (int) data.
However the difference is still there:
```
Benchmark (numPairs) (seed) Mode Cnt
Score Error Units
GcdPerformance.gcdBigInteger 100000 42 thrpt 10
96.059 ± 2.658 ops/s
GcdPerformance.gcdInt 100000 42 thrpt 10
230.165 ± 0.501 ops/s
GcdPerformance.gcdLong 100000 42 thrpt 10
218.651 ± 0.516 ops/s
GcdPerformance.oldGcdInt 100000 42 thrpt 10
252.595 ± 0.822 ops/s
GcdPerformance.oldGcdIntAdaptedForLong 100000 42 thrpt 10
131.465 ± 0.451 ops/s
GcdPerformance.oldGcdLong 100000 42 thrpt 10
58.857 ± 0.123 ops/s
```
Note that int data make BigInteger 4x faster (it has a smaller
representation in memory) and the long implementations 2x faster. So the loop
execution count is approximately halved. The new long gcd version approaches
the speed of the int versions.
Since the speed difference is still there it may be that equality comparison
between longs is slower than comparison to zero. But I did not investigate
further. Note that to compare a long is not zero you can fold the upper and
lower bits to a single 32-bit integer and test that is not zero. So perhaps
comparison of long to zero is faster due to such optimisation. But I'm not
familiar with what goes on at the hardware level.
I will repeat the benchmark on another machine when I go back to work on
Monday. I find the M2 processor can often fail to show differences that are
seen on my old Xeon workstation processor, i.e. make sure to test on the worst
processor(s) you have access to.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]