[ 
https://issues.apache.org/jira/browse/RNG-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061378#comment-18061378
 ] 

Alex Herbert commented on RNG-188:
----------------------------------

The Philox 4x64 generator uses 64-bit multiplication to create a 128-bit 
result. This has potential for a performance increase using higher JVM versions.
h2. Unsigned multiply high

The default implementation in Commons RNG uses a composed addition of products 
to compute the upper 64-bits. The java.lang.Math class has methods to compute 
the upper 64-bit result with potential intrinsic calls to native functions:
||Method||JDK||Notes||
|[multiplyHigh (Javadoc JDK 
21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#multiplyHigh(long,long)]|9|Convert
 to unsigned using:
Math.multiplyHigh(a, b) + ((a >> 63) & b) + ((b >> 63) & a)|
|[unsignedMultiplyHigh (Javadoc JDK 
21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#unsignedMultiplyHigh(long,long)]|18|
 |

Since Commons RNG targets Java 8 these methods cannot be used. However they can 
be used with a 
[MethodHandle|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/invoke/MethodHandle.html]
 introduced in Java 7:
{code:java}
// find the method
MethodHandle h = MethodHandles.lookup()
    .findStatic(Math.class,
                "unsignedMultiplyHigh",
                MethodType.methodType(long.class, long.class, long.class));

// invoke (snippet)
long a, b;
try {
    long r = (long) h2.invokeExact(a, b);
} catch (Throwable e) {
    throw new RuntimeException(e);
}
{code}
I added the use of method handles to the existing LXMBenchmark for unsigned 
multiply.
||Method||Notes||
|baseline2|Multiplication of 2 long values.|
|unsignedMultiplyHigh|Composed addition of products.|
|mhMultiplyHigh|Uses a method handle to call either Math.multiplyHigh and 
convert the result to unsigned; or defaults to the composed addition of 
products.|
|mhUnsignedMultiplyHigh|Uses a method handle to call either 
Math.unsignedMultiplyHigh; or defaults to the composed addition of products.|

The method handle is obtained once during class initialisation and invocation 
is stored as a static final reference to a primitive typed  LongBinaryOperator. 
This avoids the use of boxing arguments when the method is not available, i.e. 
the default call to the composed addition does not use a method handle. All 
calls to compute a result invoke the operator.

Results from an M2 Max processor using various OpenJDK versions. Times are in 
ns/operation:
|Method|8|11|21|
|baseline2|4.203|3.036|1.978|
|unsignedMultiplyHigh|4.475|3.211|2.228|
|mhMultiplyHigh|4.514|3.049|2.081|
|mhUnsignedMultiplyHigh|4.561|3.194|1.975|

On Java 8 all three methods should be calling the same implementation. The 
performance is within error.

On Java 11 the mhMultiplyHigh method can use the JDK multiplyHigh 
implementation and performance improves.

On Java 21 the mhUnsignedMultiplyHigh method increases performance to the speed 
of a native long multiplication.

Note that this benchmark has an overhead of generation of random long values to 
multiply. These results used 1024 precomputed long values with stride indexing 
into the array; different strides are used for argument a and b allowing 1024^2 
possible products. The benchmark overhead (baseline2) is most of the time for 
the benchmark so a real world performance test was made using the Philox 
generator.
h2. Philox4x64

I added a benchmark to test using the method handle from a Philox4x64 generator:
||RNG||Notes||
|PHILOX_4X64|The current v1.7-SNAPSHOT version of the Philox4x64 generator (for 
reference).|
|PHILOX_4X64_ORIGINAL|A copy of the original implementation of the Philox4x64 
generator, modified to make members protected to allow override.|
|PHILOX_4X64_MH|Overridden generation of the output sequence using the method 
handle to Math.multiplyHigh.|
|PHILOX_4X64_UMH|Overridden generation of the output sequence using the method 
handle to Math.unsignedMultiplyHigh.|

Results are provided across LTS JDKs using the OpenJDK implementation 
(1.8.0_472; 11.0.29; 17.0.17; 21.0.9; 25.0.2):
|RNG|8|11|17|21|25|
|PHILOX_4X64|17.63|11.03|9.495|9.452|9.536|
|PHILOX_4X64_ORIGINAL|17.49|11.15|9.491|9.454|9.57|
|PHILOX_4X64_MH|17.41|6.641|4.172|4.026|4.036|
|PHILOX_4X64_UMH|17.58|11.07|9.493|2.624|2.588|

The performance of the default implementation improves from 8 to 11 to 17 then 
plateaus.

The Philox implementation and its copy have identical results across JDKs (as 
expected).

On JDK 8 all versions are a similar speed.

On JDK 11 and 17 the use of the Math.multiplyHigh doubles the speed.

On JDK 21 and 25 the use of Math.unsignedMultiplyHigh further increase speed to 
at least 3-fold faster.
h2. Conclusion

These results indicate that the Philox4x64 generator is performance limited by 
the requirement for a 128-bit result from a 64-bit multiplication. Performance 
could be improved using a MethodHandle to faster JDK methods to compute this 
result.

> Add Philox random number generator
> ----------------------------------
>
>                 Key: RNG-188
>                 URL: https://issues.apache.org/jira/browse/RNG-188
>             Project: Commons RNG
>          Issue Type: New Feature
>          Components: core
>            Reporter: Alex Herbert
>            Priority: Minor
>             Fix For: 1.7
>
>
> Add two new random number generators: philox4x32 and philox4x64 from 
> [https://www.thesalmons.org/john/random123/]
> These are quite standard nowadays (part of CUDA, Numpy, default for Pytorch 
> on GPU, default for TensorFlow) and there is no official Java implementation.
> Counter-based PRNGs are great for parallelization, as skipping-ahead is 
> nearly as fast as a single random number generation, and the counter makes it 
> very easy to create subsequences.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to