[ 
https://issues.apache.org/jira/browse/RNG-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061645#comment-18061645
 ] 

Alex Herbert commented on RNG-191:
----------------------------------

Benchmarks added in RNG-188 show the Philox4x64 generator performance can be 
significantly improved using the multiply methods in Math.

I modified the current core component that computes the unsigned multiply of 
two longs and tested performance using the JMH module benchmarks.

There was inconsistent timing output from the LXM family of generators. These 
generators have a linear congruential generator combined with a Xor based 
generator (XBG). Only the update of the 128-bit LCG requires the 
multiplication. The update of the XBG and computation of the mixed output from 
both generators are independent. An efficient compiler and processor would be 
able to pipeline these computations in parallel.
||Method||Description||
|original|The current RNG 1.7-SNAPSHOT code|
|um|Version modified to use java.lang.Math unsignedMultiplyHigh (java 18+); 
multiplyHigh (java 9+); or default to the software implementation |
|um1|As um but using a variant for the order of the operations in the LXM 
family of generators|
|um2|As um but using a variant for the order of the operations in the LXM 
family of generators|
|um3|As um but using a variant for the order of the operations in the LXM 
family of generators|
|um4|As um but using a variant for the order of the operations in the LXM 
family of generators|
h2. Results

Processor: M2 Max; OpenJDK 17.0.17; 21.0.9. Results are in ns/operation. The 
JMH baseline of 0.4ns has been subtracted from the timing results.

Next Double
| |17| | |21| | | | | |
|RNG|original|um|um1|original|um|um1|um2|um3|um4|
|L128_X1024_MIX|3.602|3.326|3.306|3.556|4.473|4.471|4.073|3.359|3.414|
|L128_X128_MIX|3.332|4.844|4.814|3.344|5.268|5.284|3.454|3.457|3.504|
|L128_X256_MIX|6.807|5.147|5.122|6.858|3.377|3.372|4.015|4.171|3.223|
|PHILOX_4X64|9.170|3.867|3.808|9.262|3.166|2.455|2.473|2.444|2.446|

Next Long
| |17| | |21| | | | | |
|RNG|original|um|um1|original|um|um1|um2|um3|um4|
|L128_X1024_MIX|3.344|5.289|5.404|3.453|5.007|5.052|3.440|3.235|3.337|
|L128_X128_MIX|3.315|3.345|3.511|3.330|4.104|4.145|3.427|4.812|4.918|
|L128_X256_MIX|3.279|4.922|4.999|3.286|4.556|4.596|3.822|4.470|4.854|
|PHILOX_4X64|9.250|3.633|3.672|9.186|2.162|2.189|2.210|2.169|2.226|

I only ran variants um2-4 on JDK 21 to determine if the change was impactful. 
This was not the case.

Note that the next double output should be correlated with the next long 
output. It uses a long and then performs a shift and conversion of 53-bits to a 
double. So the time should be similar. However some results are clearly not 
close, for example the L128_X256_MIX generator is much slower for double 
generation using the original code than long generation. This inconsistency is 
observed across the LXM family in various results where double generation can 
be faster or slower than long generation. The JVM optimisations may be applied 
differently to the two cases during the benchmark.

The Philox generator is very performance sensitive to the unsigned multiply 
method. It is slow in the original code and increases in speed under JDK 17 and 
then again under JDK 21. These results reflect the observations from RNG-188 
when testing variants that use the Math functions.

The LXM generators L128_X128_MIX and L128_X1024_MIX are slowed down by use of 
the Math multiply methods. Reasons for this are unclear. If the call to compute 
the upper 128-bit multiplication result is faster then other optimisation made 
by the JVM under this case lead to an inconsistent slowdown overall. The 
L128_X256_MIX double generation improves under JDK 21 but not JDK 17. The 
generation of long values is slower under JDK 17 and 21.
h2. Conclusion

Since the LXM family perform two independent update steps per output cycle, one 
for the LCG and one for the XBG, it may be that increasing the speed of only 
the LCG has no effect on overall runtime. These benchmarks show no significant 
improvement in performance and often a negative slowdown of the generators. 
Further testing is required using different hardware and JVMs, and possibly 
different benchmarks such as use of the RNG with a random sampler to generate 
output.

The Philox4x64 generator update step relies heavily on the 128-bit 
multiplication. Performance gains are significant and consistent across 
benchmark runs.

 

> Use java.lang.invoke to dynamically call Math multiply high methods
> -------------------------------------------------------------------
>
>                 Key: RNG-191
>                 URL: https://issues.apache.org/jira/browse/RNG-191
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>            Reporter: Alex Herbert
>            Priority: Minor
>
> The following generators rely on a 128-bit multiplication result from two 
> 64-bit longs:
>  * L128_X128_MIX
>  * L128_X256_MIX
>  * L128_X1024_MIX
>  * PHILOX_4X64
> Java added support to java.lang.Math to compute the upper 64-bit result with 
> potential intrinsic calls to native functions:
> ||Method||JDK||Notes||
> |[multiplyHigh (Javadoc JDK 
> 21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#multiplyHigh(long,long)]|9|Convert
>  to unsigned using:
> Math.multiplyHigh(a, b) + ((a >> 63) & b) + ((b >> 63) & a)|
> |[unsignedMultiplyHigh (Javadoc JDK 
> 21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#unsignedMultiplyHigh(long,long)]|18|
>  |
> Since Commons RNG targets Java 8 these methods cannot be used. The current 
> RNGs use a software implementation to compose the upper bits of the 128-bit 
> result. However the methods can be used with a 
> [MethodHandle|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/invoke/MethodHandle.html]
>  introduced in Java 7:
> {code:java}
> // find the method
> MethodHandle h = MethodHandles.publicLookup()
>     .findStatic(Math.class,
>                 "unsignedMultiplyHigh",
>                 MethodType.methodType(long.class, long.class, long.class));
> // invoke (snippet)
> long a, b;
> try {
>     long r = (long) h2.invokeExact(a, b);
> } catch (Throwable e) {
>     throw new RuntimeException(e);
> }
> {code}
> Investigate the use of MethodHandle within the named RNGs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to