[ 
https://issues.apache.org/jira/browse/RNG-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910049#comment-16910049
 ] 

Alex D Herbert commented on RNG-113:
------------------------------------

A quick test for a modified nextLong/nextDouble version removing all the method 
calls and putting the entire code inline:
||Name||Method||Score||Error||
|xshrs|nextLong|5.855|0.335|
|xshrsInline|nextLong|5.812|0.089|
|xshrsLong|nextLong|5.454|0.122|
|xshrs|nextDouble|7.072|0.529|
|xshrsInline|nextDouble|7.041|0.609|
|xshrsLong|nextDouble|6.378|0.041|
|xshrsDouble|nextDouble|6.320|0.057|

(Here I omitted the duplicate xshrsDouble::nextLong result.)

The inline version is not faster than the default implementation. The small 
differences are well within measurement error.

The specialised version is again observed to be faster.

I have rewritten the test benchmark to use the following structure in the base 
implementation:
{code:java}
/** {@inheritDoc} */
@Override
public int next() {
    final long x = state;
    state = bump(state);
    return transform(x);
}

/**
 * Transform the value to a single int.
 *
 * @param x the value
 * @return the int
 */
protected abstract int transform(long x);

/** {@inheritDoc} */
@Override
public long nextLong() {
    final long x = state;
    final long y = bump(x);
    state = bump(y);
    return transformLong(x, y);
}

/**
 * Transform the two values to a single long.
 *
 * @param x the first value
 * @param y the second value
 * @return the long
 */
protected long transformLong(long x, long y) {
    // Default implementation
    return NumberFactory.makeLong(transform(x), transform(y));
}

/** {@inheritDoc} */
@Override
public double nextDouble() {
    final long x = state;
    final long y = bump(x);
    state = bump(y);
    return transformDouble(x, y);
}

/**
 * Transform the two values to a single double.
 *
 * @param x the first value
 * @param y the second value
 * @return the double
 */
protected double transformDouble(long x, long y) {
    // Default implementation
    return NumberFactory.makeDouble(transform(x), transform(y));
}
{code}
When the transform methods are overridden for long and double output then the 
speed-up is still observed.

A class implementing a permutation for the base LCG must implement the base 
transform function and can optionally override the default implementation of 
the transform functions to transform two successive longs.

 

> Speed improvement to PCG 32 Xsh-Rs implementation
> -------------------------------------------------
>
>                 Key: RNG-113
>                 URL: https://issues.apache.org/jira/browse/RNG-113
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Priority: Minor
>
> Investigate a possible speed increase for the PCG 32 Xsh-Rs generator for the 
> methods nextLong and nextDouble.
> The default implementation for nextLong() and nextDouble() for IntProvider is:
> {code:java}
> public long nextLong() {
>     return NumberFactory.makeLong(nextInt(), nextInt());
> }
> public double nextDouble() {
>     return NumberFactory.makeDouble(nextInt(), nextInt());
> }
> {code}
> Each of these methods composes a long from two int values converted to long.
> The Pcg32 implementation of nextInt() creates the int from a primitive cast 
> conversion of a long. *Thus the current default implementation for nextLong 
> and nextDouble is performing a round trip of a long to an int to a long twice 
> over.* The same logic can be implemented by a special implementation of the 
> methods that work only using long types.
> {code:java}
> public long nextLong() {
>     // Get two values from the LCG
>     final long x = state;
>     final long y = bump(state);
>     state = bump(y);
>     // Perform mix function.
>     // For a 32-bit output the x bits should be shifted down (22 + (int) (x 
> >>> 61)).
>     // Leave in the upper bits by shift up 32 - (22 + (int) (x >>> 61))
>     final long upper = (x ^ (x >>> 22)) << (10 - (int) (x >>> 61));
>     final long lower = (y ^ (y >>> 22)) >>> (22 + (int) (y >>> 61));
>     return (upper & 0xffffffff00000000L) | (lower & 0xffffffffL);
> }
> /** Upper mask for top 26 bits shifted by 27 bits. */
> private static final long MASK1 = ((long) (0xffffffff >>> 6)) << 27;
> /** Lower mask shifted by 5 for 27 bits. */
> private static final long MASK2 = 0xffffffffL >>> 5;
> public double nextDouble() {
>     // Get two values from the LCG
>     final long x = state;
>     final long y = bump(state);
>     state = bump(y);
>     // Perform mix function.
>     // For a 32-bit output the x bits should be shifted down (22 + (int) (x 
> >>> 61)).
>     // To match nextDouble requires 26-bits from int 1 and 27 bits from int 2.
>     // Int 1 is stored in the upper 32-bits as per nextLong() but shifted 
> down 11 and
>     // then masked to keep the upper 26-bits. Discard an extra 5 from int 2.
>     final long upper = (x ^ (x >>> 22)) >>> (1 + (int) (x >>> 61));
>     final long lower = (y ^ (y >>> 22)) >>> (27 + (int) (y >>> 61));
>     return ((upper & MASK1) | (lower & MASK2)) * 0x1.0p-53;
> }
> {code}
> These implementations require that the private {{state}} member and 
> {{bump()}} method of the parent AbstractPcg6432 are made accessible.
> The change should be tested to determine if there is a performance increase 
> with a custom implementation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to