[ 
https://issues.apache.org/jira/browse/RNG-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909364#comment-16909364
 ] 

Alex D Herbert commented on RNG-113:
------------------------------------

I have benchmarked the standard XshRs implementation against:
 * xshrsLong: A specialised version using the nextLong method as described that 
avoids a conversion to int then back to long. The nextDouble method uses 
{{NumberFactory.makeDouble(nextLong())}}. This does not match the standard 
IntProvider implementation but does avoid the round trip to int and back to 
long.

 - xshrsDouble: A specialised version using the nextLong and nextDouble methods 
as described that avoids a conversion to int then back to long.

Here is performance for number generation. The JMH baseline for consuming a 
long/double value has been subtracted.
||Name||Method||Time||Relative||
|xshrs|nextDouble|4.625|1.000|
|xshrsLong|nextDouble|4.017|0.868|
|xshrsDouble|nextDouble|3.958|0.856|
|xshrs|nextLong|3.649|1.000|
|xshrsLong|nextLong|3.122|0.856|
|xshrsDouble|nextLong|3.176|0.870|

Thus the custom implementations are faster. There are two results for the 
alternative nextLong method as both implementations use the same method. The 
nextDouble methods are different. The xshrsDouble is marginally faster but the 
results are close and may be within error. 

To test if the improvement translates to a real world use case I tested:
 * 10 samples from a SmallMeanPoissonSampler with mean 40. This sampler uses 
approximately (mean+1) calls to nextDouble per sample.
 * 100 samples from a ZigguratNormalizedGaussianSampler. This sampler mainly 
uses nextLong and some calls to nextDouble to fix values approximately 4% of 
the time.

||Name||Method||Median||Relative||
|xshrs|nextGaussianSample|1599.00|1.000|
|xshrsLong|nextGaussianSample|1591.46|0.995|
|xshrsDouble|nextGaussianSample|1539.65|0.963|
|xshrs|nextPoissonSample|1665.13|1.000|
|xshrsLong|nextPoissonSample|1575.59|0.946|
|xshrsDouble|nextPoissonSample|1497.86|0.900|

Here the custom implementations are faster. The custom version using a 
specialised nextDouble is faster than converting a long to a double (as per 
xshrsLong). The custom nextDouble implementation avoids one right shift of a 
long value and changes a left shift for a right shift. These differences have 
modestly increased speed. The output from the generator is identical to the 
original IntProvider implementation.

 

> Speed improvement to PCG 32 Xsh-Rs implementation
> -------------------------------------------------
>
>                 Key: RNG-113
>                 URL: https://issues.apache.org/jira/browse/RNG-113
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Priority: Minor
>
> Investigate a possible speed increase for the PCG 32 Xsh-Rs generator for the 
> methods nextLong and nextDouble.
> The default implementation for nextLong() and nextDouble() for IntProvider is:
> {code:java}
> public long nextLong() {
>     return NumberFactory.makeLong(nextInt(), nextInt());
> }
> public double nextDouble() {
>     return NumberFactory.makeDouble(nextInt(), nextInt());
> }
> {code}
> Each of these methods composes a long from two int values converted to long.
> The Pcg32 implementation of nextInt() creates the int from a primitive cast 
> conversion of a long. *Thus the current default implementation for nextLong 
> and nextDouble is performing a round trip of a long to an int to a long twice 
> over.* The same logic can be implemented by a special implementation of the 
> methods that work only using long types.
> {code:java}
> public long nextLong() {
>     // Get two values from the LCG
>     final long x = state;
>     final long y = bump(state);
>     state = bump(y);
>     // Perform mix function.
>     // For a 32-bit output the x bits should be shifted down (22 + (int) (x 
> >>> 61)).
>     // Leave in the upper bits by shift up 32 - (22 + (int) (x >>> 61))
>     final long upper = (x ^ (x >>> 22)) << (10 - (int) (x >>> 61));
>     final long lower = (y ^ (y >>> 22)) >>> (22 + (int) (y >>> 61));
>     return (upper & 0xffffffff00000000L) | (lower & 0xffffffffL);
> }
> /** Upper mask for top 26 bits shifted by 27 bits. */
> private static final long MASK1 = ((long) (0xffffffff >>> 6)) << 27;
> /** Lower mask shifted by 5 for 27 bits. */
> private static final long MASK2 = 0xffffffffL >>> 5;
> public double nextDouble() {
>     // Get two values from the LCG
>     final long x = state;
>     final long y = bump(state);
>     state = bump(y);
>     // Perform mix function.
>     // For a 32-bit output the x bits should be shifted down (22 + (int) (x 
> >>> 61)).
>     // To match nextDouble requires 26-bits from int 1 and 27 bits from int 2.
>     // Int 1 is stored in the upper 32-bits as per nextLong() but shifted 
> down 11 and
>     // then masked to keep the upper 26-bits. Discard an extra 5 from int 2.
>     final long upper = (x ^ (x >>> 22)) >>> (1 + (int) (x >>> 61));
>     final long lower = (y ^ (y >>> 22)) >>> (27 + (int) (y >>> 61));
>     return ((upper & MASK1) | (lower & MASK2)) * 0x1.0p-53;
> }
> {code}
> These implementations require that the private {{state}} member and 
> {{bump()}} method of the parent AbstractPcg6432 are made accessible.
> The change should be tested to determine if there is a performance increase 
> with a custom implementation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to