[ 
https://issues.apache.org/jira/browse/RNG-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062145#comment-18062145
 ] 

Alex Herbert commented on RNG-191:
----------------------------------

h2. LXM generators are not slowed by unsigned multiply

Further to the benchmark results above, I added a dedicated benchmark for the 
LXM generators using either the current unsigned multiply high or the use of 
Math.unsignedMultiplyHigh/multiplyHigh. I tested this on an M2 Pro across 
various JVMs. I will not show the results as they are similar to those above. 
The use of the Math multiply methods are either the same speed or slower.

The conclusion is that the unsigned multiply is not the performance critical 
part of the LXM generators. To test this I removed the call completely and used 
a standard 64-bit multiply. This will create a broken generator with the wrong 
output. These generators were either the same speed or slower than the original 
implementation.
h2. Use within a sampler

It is possible that the nextLong benchmark is optimising the generator so that 
the unsigned multiply can be computed in parallel on the processor pipeline. I 
tested using the generator in more complex code to generate Gaussian samples:
||Sampler||RNG use||
|BoxMullerNormalizedGaussianSampler|Creates two samples with a call to 
nextDouble and nextLong. Most of the time is used to map samples to a Gaussian 
using Math calls to sin, cos, sqrt and log. There is no rejection rate.|
|MarsagliaNormalizedGaussianSampler|Creates 2 double samples from a unit box 
and maps to a unit circle. Rejection rate is 1 - pi/4 = 21.4%. When inside the 
circle a call to Math sqrt and log is used to map to a Gaussian.|
|ZigguratNormalizedGaussianSampler|Uses a look-up table approximately 98.5% of 
the time with nextLong. The other branch fixes the sample output using branches 
with either nextLong or nextDouble.|
|ZigguratSampler.NormalizedGaussian|Uses a look-up table approximately 98.8% of 
the time with nextLong. The other branch fixes the sample output using branches 
with nextLong.|

Results on an M2 Max using OpenJDK 1.8.0_472 or 21.0.9. Results in 
ns/operation. 
||RNG||JDK||Version||BoxMullerNormalizedGaussianSampler||MarsagliaNormalizedGaussianSampler||ZigguratNormalizedGaussianSampler||ZigguratSampler.NormalizedGaussian||
|L128_X1024_MIX|8|Original|167.04|113.28|16.80|7.29|
|L128_X1024_MIX|8|UM|170.36|118.70|16.55|6.79|
|L128_X1024_MIX|21|Original|15.09|13.26|5.14|4.61|
|L128_X1024_MIX|21|UM|14.59|12.12|4.98|4.13|
|L128_X128_MIX|8|Original|165.87|113.71|16.51|5.93|
|L128_X128_MIX|8|UM|167.37|112.62|15.12|6.07|
|L128_X128_MIX|21|Original|17.55|13.13|4.74|4.21|
|L128_X128_MIX|21|UM|15.73|11.01|4.29|3.97|
|L128_X256_MIX|8|Original|167.73|114.23|15.52|6.06|
|L128_X256_MIX|8|UM|170.72|113.64|14.91|6.06|
|L128_X256_MIX|21|Original|17.26|13.03|5.02|4.39|
|L128_X256_MIX|21|UM|15.75|11.12|4.22|4.66|
|PHILOX_4X64|8|Original|182.16|129.03|25.57|19.21|
|PHILOX_4X64|8|UM|183.91|131.07|25.71|19.67|
|PHILOX_4X64|21|Original|25.45|25.65|12.02|11.63|
|PHILOX_4X64|21|UM|20.13|20.62|4.88|5.40|

The BoxMuller and Marsaglia samplers are slow. The speed of the generator is 
not critical. Use of unsigned multiply can speed up the Philox results by 20%. 
A smaller speed-up is observed for the LXM generators. This is in contrast to 
the benchmark of the generator on its own. This is the first consistent result 
showing use of unsigned multiply can speed up LXM output.

The lookup table samplers are fast enough that the generator speed is 
significant. Here use of unsigned multiply doubles the speed of the Philox 
sampler. The LXM generators can get faster but the difference is relatively 
small. One result is slower for the ZigguratSampler.NormalizedGaussian using 
L128_X256_MIX but this is within noise of the test results. The time for the 
original result on JDK 21 is 4.38 +/- 0.51 showing large variation in this 
result.
h2. Conclusion

When used within a sampler the LXM family of generator show a small increase in 
speed. This is not consistent across all generators. The L128_X256_MIX 
generator shows slowdown more often than the other generators tested.

The Philox generator consistently increases performance with Math multiply high 
methods.

I would recommend using the Math multiply high methods within the Philox 
generator only. Use across the LXM family shows inconsistent results and 
further testing is required.

 

 

> Use java.lang.invoke to dynamically call Math multiply high methods
> -------------------------------------------------------------------
>
>                 Key: RNG-191
>                 URL: https://issues.apache.org/jira/browse/RNG-191
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>            Reporter: Alex Herbert
>            Priority: Minor
>
> The following generators rely on a 128-bit multiplication result from two 
> 64-bit longs:
>  * L128_X128_MIX
>  * L128_X256_MIX
>  * L128_X1024_MIX
>  * PHILOX_4X64
> Java added support to java.lang.Math to compute the upper 64-bit result with 
> potential intrinsic calls to native functions:
> ||Method||JDK||Notes||
> |[multiplyHigh (Javadoc JDK 
> 21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#multiplyHigh(long,long)]|9|Convert
>  to unsigned using:
> Math.multiplyHigh(a, b) + ((a >> 63) & b) + ((b >> 63) & a)|
> |[unsignedMultiplyHigh (Javadoc JDK 
> 21)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#unsignedMultiplyHigh(long,long)]|18|
>  |
> Since Commons RNG targets Java 8 these methods cannot be used. The current 
> RNGs use a software implementation to compose the upper bits of the 128-bit 
> result. However the methods can be used with a 
> [MethodHandle|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/invoke/MethodHandle.html]
>  introduced in Java 7:
> {code:java}
> // find the method
> MethodHandle h = MethodHandles.publicLookup()
>     .findStatic(Math.class,
>                 "unsignedMultiplyHigh",
>                 MethodType.methodType(long.class, long.class, long.class));
> // invoke (snippet)
> long a, b;
> try {
>     long r = (long) h2.invokeExact(a, b);
> } catch (Throwable e) {
>     throw new RuntimeException(e);
> }
> {code}
> Investigate the use of MethodHandle within the named RNGs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to