[
https://issues.apache.org/jira/browse/RNG-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061378#comment-18061378
]
Alex Herbert commented on RNG-188:
----------------------------------
The Philox 4x64 generator uses 64-bit multiplication to create a 128-bit
result. This has potential for a performance increase using higher JVM versions.
h2. Unsigned multiply high
The default implementation in Commons RNG uses a composed addition of products
to compute the upper 64-bits. The java.lang.Math class has methods to compute
the upper 64-bit result with potential intrinsic calls to native functions:
||Method||JDK||Notes||
|[multiplyHigh (Javadoc JDK
21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#multiplyHigh(long,long)]|9|Convert
to unsigned using:
Math.multiplyHigh(a, b) + ((a >> 63) & b) + ((b >> 63) & a)|
|[unsignedMultiplyHigh (Javadoc JDK
21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#unsignedMultiplyHigh(long,long)]|18|
|
Since Commons RNG targets Java 8 these methods cannot be used. However they can
be used with a
[MethodHandle|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/invoke/MethodHandle.html]
introduced in Java 7:
{code:java}
// find the method
MethodHandle h = MethodHandles.lookup()
.findStatic(Math.class,
"unsignedMultiplyHigh",
MethodType.methodType(long.class, long.class, long.class));
// invoke (snippet)
long a, b;
try {
long r = (long) h2.invokeExact(a, b);
} catch (Throwable e) {
throw new RuntimeException(e);
}
{code}
I added the use of method handles to the existing LXMBenchmark for unsigned
multiply.
||Method||Notes||
|baseline2|Multiplication of 2 long values.|
|unsignedMultiplyHigh|Composed addition of products.|
|mhMultiplyHigh|Uses a method handle to call either Math.multiplyHigh and
convert the result to unsigned; or defaults to the composed addition of
products.|
|mhUnsignedMultiplyHigh|Uses a method handle to call either
Math.unsignedMultiplyHigh; or defaults to the composed addition of products.|
The method handle is obtained once during class initialisation and invocation
is stored as a static final reference to a primitive typed LongBinaryOperator.
This avoids the use of boxing arguments when the method is not available, i.e.
the default call to the composed addition does not use a method handle. All
calls to compute a result invoke the operator.
Results from an M2 Max processor using various OpenJDK versions. Times are in
ns/operation:
|Method|8|11|21|
|baseline2|4.203|3.036|1.978|
|unsignedMultiplyHigh|4.475|3.211|2.228|
|mhMultiplyHigh|4.514|3.049|2.081|
|mhUnsignedMultiplyHigh|4.561|3.194|1.975|
On Java 8 all three methods should be calling the same implementation. The
performance is within error.
On Java 11 the mhMultiplyHigh method can use the JDK multiplyHigh
implementation and performance improves.
On Java 21 the mhUnsignedMultiplyHigh method increases performance to the speed
of a native long multiplication.
Note that this benchmark has an overhead of generation of random long values to
multiply. These results used 1024 precomputed long values with stride indexing
into the array; different strides are used for argument a and b allowing 1024^2
possible products. The benchmark overhead (baseline2) is most of the time for
the benchmark so a real world performance test was made using the Philox
generator.
h2. Philox4x64
I added a benchmark to test using the method handle from a Philox4x64 generator:
||RNG||Notes||
|PHILOX_4X64|The current v1.7-SNAPSHOT version of the Philox4x64 generator (for
reference).|
|PHILOX_4X64_ORIGINAL|A copy of the original implementation of the Philox4x64
generator, modified to make members protected to allow override.|
|PHILOX_4X64_MH|Overridden generation of the output sequence using the method
handle to Math.multiplyHigh.|
|PHILOX_4X64_UMH|Overridden generation of the output sequence using the method
handle to Math.unsignedMultiplyHigh.|
Results are provided across LTS JDKs using the OpenJDK implementation
(1.8.0_472; 11.0.29; 17.0.17; 21.0.9; 25.0.2):
|RNG|8|11|17|21|25|
|PHILOX_4X64|17.63|11.03|9.495|9.452|9.536|
|PHILOX_4X64_ORIGINAL|17.49|11.15|9.491|9.454|9.57|
|PHILOX_4X64_MH|17.41|6.641|4.172|4.026|4.036|
|PHILOX_4X64_UMH|17.58|11.07|9.493|2.624|2.588|
The performance of the default implementation improves from 8 to 11 to 17 then
plateaus.
The Philox implementation and its copy have identical results across JDKs (as
expected).
On JDK 8 all versions are a similar speed.
On JDK 11 and 17 the use of the Math.multiplyHigh doubles the speed.
On JDK 21 and 25 the use of Math.unsignedMultiplyHigh further increase speed to
at least 3-fold faster.
h2. Conclusion
These results indicate that the Philox4x64 generator is performance limited by
the requirement for a 128-bit result from a 64-bit multiplication. Performance
could be improved using a MethodHandle to faster JDK methods to compute this
result.
> Add Philox random number generator
> ----------------------------------
>
> Key: RNG-188
> URL: https://issues.apache.org/jira/browse/RNG-188
> Project: Commons RNG
> Issue Type: New Feature
> Components: core
> Reporter: Alex Herbert
> Priority: Minor
> Fix For: 1.7
>
>
> Add two new random number generators: philox4x32 and philox4x64 from
> [https://www.thesalmons.org/john/random123/]
> These are quite standard nowadays (part of CUDA, Numpy, default for Pytorch
> on GPU, default for TensorFlow) and there is no official Java implementation.
> Counter-based PRNGs are great for parallelization, as skipping-ahead is
> nearly as fast as a single random number generation, and the counter makes it
> very easy to create subsequences.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)