[ 
https://issues.apache.org/jira/browse/RNG-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899681#comment-16899681
 ] 

Alex D Herbert commented on RNG-90:
-----------------------------------

I have run some benchmarking of the nextInt(int) method from the JDK verses the 
version of Lemire. The JDK method has a special case to test for a power of 2 
which is dealt with by skipping the rejection loop. I tested
 * JDK - JDK method without the power of 2 edge case
 * JDK2 - JDK method with the power of 2 edge case
 * Lemire - Lemire method without the power of 2 edge case
 * Lemire2 - Lemire method with the power of 2 edge case

For reference here are the methods. The power of 2 edge case drops the initial 
test for a power of 2.

*JDK2*
{code:java}
public int nextInt(int n) {
    checkStrictlyPositive(n);

    final int nm1 = n - 1;
    if ((n & nm1) == 0) {
        // Power of 2
        return next() & nm1;
    }

    int bits;
    int val;
    do {
        bits = next() >>> 1;
        val = bits % n;
    } while (bits - val + nm1 < 0);

    return val;
}
{code}
*Lemire2*
{code:java}
public int nextInt(int n) {
    checkStrictlyPositive(n);

    final int nm1 = n - 1;
    if ((n & nm1) == 0) {
        // Power of 2
        return next() & nm1;
    }

    long m = (next() & 0xffffffffL) * n;
    long l = m & 0xffffffffL;
    if (l < n) {
        // 2^32 % n
        final long t = POW_32 % n;
        while (l < t) {
            m = (next() & 0xffffffffL) * n;
            l = m & 0xffffffffL;
        }
    }
    return (int) (m >>> 32);
}
{code}
*Note*
 This method was also tested using 31-bit math for the Lemire algorithm to use 
an integer modulus operation:
{code:java}
long m = (nextInt() & 0x7fffffffL) * n;
long l = m & 0x7fffffffL;
if (l < n) {
    // 2^31 % n
    final long t = (Integer.MIN_VALUE - n) % n;
    while (l < t) {
        m = (nextInt() & 0x7fffffffL) * n;
        l = m & 0x7fffffffL;
    }
}
return (int) (m >>> 31);
{code}
However switching from 32-bit to 31-bit math changes the rejection probability 
from ((2^32 % n) / 2^32) to ((2^31 % n) / 2^31). This roughly doubles the 
rejection probability and this has a greater negative effect on performance 
than the switch to an integer modulus. So this method is slower.
h3. *Benchmarks*

The two methods were tested using power of 1 and non power of 2 sizes. This 
includes the worst case scenario for the rejection loop which is 2^30 + 1.

*nextInt(int) benchmark*

Run a single call to nextInt(int). The timings have a baseline for JMH 
measurement subtracted. This is simple code to return an integer and measures 
the timing overhead of JHM.
||Size||JDK||JDK2||Lemire||Lemire2||
|16|4.956|2.153|3.228|2.125|
|17|5.107|5.876|2.033|2.919|
|256|4.999|2.168|2.054|2.140|
|257|5.320|5.954|2.155|2.601|
|4096|5.671|1.873|2.194|2.178|
|4097|5.764|5.407|2.166|2.917|
|1073741824|5.174|1.884|8.905|1.825|
|1073741825|27.224|25.715|9.173|9.321|
 * The power of 2 edge case is much faster than without as the JMH branch 
prediction can pick this code path for the case of a fixed size for the range.
 * The Lemire method is faster than the JDK method for small n.
 * For large power of 2 N is it slower than the JDK method when both are 
without the power of 2 edge case. This can be attributed to the JDK method 
performing the modulus operation with int values but the Lemire method uses a 
long multiply and then a long modulus operation before it discovers that the 
threshold value is zero and the result cannot be rejected.
 * For the worst case scenario for rejection the Lemire method is 3 times faster

*nextInt(int) benchmark in a loop*
 Run 65536 calls to nextInt(int) summing the results. JMH is told to normalise 
times by 65536. This effectively removes the overhead for JMH consuming a value 
but adds a single int addition per call to nextInt(int) to the timings.
||Size||JDK||JDK2||Lemire||Lemire2||
|16|3.819|1.200|2.622|1.132|
|17|4.210|4.047|2.828|2.613|
|256|3.807|1.177|2.828|1.143|
|257|4.277|4.019|2.584|2.641|
|4096|3.800|1.177|2.607|1.146|
|4097|4.024|4.046|2.560|2.857|
|1073741824|3.771|1.201|7.203|1.098|
|1073741825|22.439|22.734|9.377|8.340|

Results are similar to the previous benchmark.
 * The power of 2 edge case has an advantage when the range is fixed.
 * The Lemire method does not handle very large power of 2 edge case very well.
 * The Lemire method is much faster for non-power of 2 range.

*shuffle benchmark*
 This tests calling nextInt(int) with different size arguments by performing a 
Fischer-Yates shuffle on array data. For data of size N all values in the range 
[2, N] are used once as the argument to nextInt(int) per shuffle.
||Size||JDK||JDK2||Lemire||Lemire2||
|4|18.103|14.026|12.102|12.692|
|16|78.223|69.783|41.317|51.883|
|256|1192.761|1394.104|653.311|837.604|
|4096|20974.588|20675.709|9661.873|10615.897|
|16384|72018.467|88005.242|41543.699|42490.870|
 * The power of 2 edge case is worse. The number of times the range can be 
positive and a power of 2 is only 30. So the overhead of checking each time for 
a power of 2 is noticed in the shuffled where branch prediction fails to know 
when the power of 2 edge case will occur.
 * The Lemire method shuffles even modest sized arrays (16) almost twice as 
fast. This performance benefit is observed on large arrays.

*Pseudo-shuffle benchmark*
This is as per the shuffle benchmark except data is not actually swapped in an 
array. The indices are generated and summed. This allows a larger range to be 
tested without memory effects (or restraints).
||Size||JDK||JDK2||Lemire||Lemire2||
|4|16.108|12.993|12.416|11.792|
|16|66.290|83.589|40.324|41.140|
|256|1161.457|1233.644|680.788|663.370|
|4096|20405.805|18548.533|9823.229|11486.817|
|16384|69509.409|92301.776|39124.827|41823.949|

The results for the same sizes are in a similar scale showing the pseudo 
shuffle is a good approximation for a real shuffle. The results are the same:
 * Using a power of 2 edge case slows the shuffle down
 * The Lemire method is almost twice as fast

*Conclusion*

The Lemire method has a large performance benefit for computing a value in a 
range. There is a noticeable negative effect on performance when the range is a 
very large power of 2. This would be expected to be a marginal use case for the 
method.

If the range is fixed the recommendation would be to use a 
DiscreteUniformSampler that will choose the appropriate method to sample the 
range with an optimal algorithm.

For a generic use case the number of positive integers with a power of 2 is 30 
in 2^31 -1. This is approximately 1e-8. In addition for those powers of 2 that 
are small the rejections probability is so low that rejection never practically 
occurs and the method is not slower. This is demonstrated by the better speed 
on the shuffle benchmark with a size of 4 and 16.

Also note that the current algorithm implements the JDK method from 
java.util.Random. This tests for a power of 2 only to avoid using the lower 
order bits from the Linear Congruential Generator and switches to a 
multiplication algorithm. If the code is modified to always use the 
multiplication algorithm then this edge case for low complexity lower order 
bits will still be satisfied. I would recommend updating the algorithm to use 
the Lemire method without the power of 2 edge case. This will provide the 
fastest performance for shuffling, i.e generating numbers from various ranges. 
For dedicated sampling from the same range then the DiscreteUniformSampler 
should be employed.

> Improve nextInt(int) and nextLong(long) for powers of 2
> -------------------------------------------------------
>
>                 Key: RNG-90
>                 URL: https://issues.apache.org/jira/browse/RNG-90
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The code for nextInt(int) checks the range number n is a power of two and if 
> so it computes a fast solution:
> {code:java}
>     return (int) ((n * (long) (nextInt() >>> 1)) >> 31); 
> {code}
> This scales a 31 bit positive number by a power of 2 (i.e. n) then discards 
> the least significant bits. An alternative result can be achieved using a 
> mask to discard the most significant bits:
> {code:java}
>     return nextInt() & (n-1)
> {code}
> This works if n is a power of 2 as (n-1) will be all the bits set below it. 
> Note: This method is employed by ThreadLocalRandom.
> It also makes the method applicable to nextLong(long) since you no longer 
> require the long multiplication arithmetic.
> The mask version is applicable to any generator with a long period in the 
> lower order bits. The current version for any generator with a short period 
> in the lower order bits. The non-masking method is employed by 
> {{java.util.Random}} which is a weak generator.
> The methods are currently in {{BaseProvider}}. I suggest dividing the methods 
> to use protected methods to compute the result:
> {code:java}
> @Override
> public int nextInt(int n) {
>     checkStrictlyPositive(n);
>     final int nm1 = n - 1;
>     if ((n & nm1) == 0) {
>         // Range is a power of 2
>         return nextIntPowerOfTwo(n, nm1);
>     }
>     return nextIntNonPowerOfTwo(n, nm1);
> }
> /**
>  * Generates an {@code int} value between 0 (inclusive) and the
>  * specified value (exclusive).
>  *
>  * @param n Bound on the random number to be returned. This is a power of 2.
>  * @param nm1 The bound value minus 1.
>  * @return a random {@code int} value between 0 (inclusive) and {@code n}
>  * (exclusive).
>  */
> protected int nextIntPowerOfTwo(int n, int nm1) {
>     return nextInt() & nm1;
> }
> /**
>  * Generates an {@code int} value between 0 (inclusive) and the specified 
> value
>  * (exclusive).
>  *
>  * @param n Bound on the random number to be returned. This is not a power of 
> 2.
>  * @param nm1 The bound value minus 1.
>  * @return a random {@code int} value between 0 (inclusive) and {@code n} 
> (exclusive).
>  */
> protected int nextIntNonPowerOfTwo(int n, int nm1) {
>     int bits;
>     int val;
>     do {
>         bits = nextInt() >>> 1;
>         val = bits % n;
>     } while (bits - val + nm1 < 0);
>     return val;
> }
> @Override
> public long nextLong(long n) {
>     checkStrictlyPositive(n);
>     final long nm1 = n - 1;
>     if ((n & nm1) == 0) {
>         // Range is a power of 2
>         return nextLongPowerOfTwo(n, nm1);
>     }
>     return nextLongNonPowerOfTwo(n, nm1);
> }
> /**
>  * Generates an {@code long} value between 0 (inclusive) and the
>  * specified value (exclusive).
>  *
>  * @param n Bound on the random number to be returned. This is a power of 2.
>  * @param nm1 The bound value minus 1.
>  * @return a random {@code long} value between 0 (inclusive) and {@code n}
>  * (exclusive).
>  */
> protected long nextLongPowerOfTwo(long n, long nm1) {
>     return nextLong() & nm1;
> }
> /**
>  * Generates an {@code long} value between 0 (inclusive) and the specified 
> value
>  * (exclusive).
>  *
>  * @param n Bound on the random number to be returned. This is not a power of 
> 2.
>  * @param nm1 The bound value minus 1.
>  * @return a random {@code long} value between 0 (inclusive) and {@code n} 
> (exclusive).
>  */
> protected long nextLongNonPowerOfTwo(long n, long nm1) {
>     long bits;
>     long val;
>     do {
>         bits = nextLong() >>> 1;
>         val = bits % n;
>     } while (bits - val + nm1 < 0);
>     return val;
> }
> {code}
> This will update all providers to use the new method. Then the JDK 
> implementation can be changed to override the default:
> {code:java}
> @Override
> protected int nextIntPowerOfTwo(int n, int nm1) {
>     return (int) ((n * (long) (nextInt() >>> 1)) >> 31);
> }
> @Override
> protected long nextLongPowerOfTwo(long n, long nm1) {
>     return nextLongNonPowerOfTwo(n, nm1);
> }
> {code}
> I do not know how the use of protected methods will affect performance. An 
> alternative is to inline the entire computation for the new masking method:
> {code:java}
> public int nextInt(int n) {
>     checkStrictlyPositive(n);
>     final int nm1 = n - 1;
>     if ((n & nm1) == 0) {
>         // Range is a power of 2
>         return nextInt() & nm1;
>     }
>     int bits;
>     int val;
>     do {
>         bits = nextInt() >>> 1;
>         val = bits % n;
>     } while (bits - val + nm1 < 0);
>     return val;
> }
> {code}
> Then rewrite the entire method in the JDK generator. This will be less 
> flexible if other generators are added than have short periods in the lower 
> order bits.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to