[ 
https://issues.apache.org/jira/browse/RNG-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802299#comment-16802299
 ] 

Alex D Herbert commented on RNG-87:
-----------------------------------

I have investigated the overhead involved in the test by repeat calling the RNG 
inside the loop. The return value is used in an addition to prevent it being 
optimised away. Here's an example for 3 calls:
{code:java}
public void nextInt3(Sources sources, Blackhole bh) {
    int value = 0;
    UniformRandomProvider rng = sources.getGenerator();
    for (int i = numValues; i-- != 0; ) {
        value += rng.nextInt();
        value += rng.nextInt();
        value += rng.nextInt();
    }
    bh.consume(value);
}
{code}
A picture tells a 1000 words...

!MWC_256.jpg!

There is something strange happening with the start-up cost or test overhead of 
the original MWC_256 algorithm. The slope is perhaps slightly steeper than the 
others but the difference at 1 call to RNG is so big it appears to be twice as 
slow. When 7 calls are made the difference is 43% more than MWC_256C.

I would say that this still looks like a victory for MWC_256C. The gradient is 
less than for MWC_256B.

This seems to indicate that the current benchmarking method of using a loop and 
consuming the output value from the RNG inside the loop is not suitable. The 
overhead in the loop construct and the Blackhole consume method *relative* to 
the speed that the RNG can produce numbers is noticable. This can be reduced by 
running a second loop inside the first.

In the example above I manually unrolled the loop. I've also tried an inner 
code loop with a constant for the loop limit and this produces a similar result 
although the two new methods are now very close:

!MWC_256_2.jpg!

In this case the inner loop may be preventing some optimisations with MWC_256. 
It does not scale close to the other methods.

To add fuel to the fire I let JMH set the number of values for the inner loop:
{code:java}
public void nextInt(Sources sources, Blackhole bh) {
    int value = 0;
    UniformRandomProvider rng = sources.getGenerator();
    for (int i = numValues; i-- > 0;) {
        for (int j = numValues2; j-- > 0;) {
            value += rng.nextInt();
        }
    }
    bh.consume(value);
}
{code}
!MWC_256_3.jpg!

So:

3 methods for using an inner loop to make the RNG do extra work to investigate 
scaling:
 * Manually unroll a loop. The original MWC_256 is a bit worse. There is some 
noticeable start-up cost for when the inner iteration is 1 for the MWC_256 
method which makes it appears worse than it is relative to the others.
 * Code an inner loop with a constant (hopefully it will be unrolled although 
this is not certain). The original MWC_256 is a lot worse.
 * Code an inner loop with a limit set using a JMH controlled variable. The 
original MWC_256 is worse.

It is not clear cut which of MWC_256B or C is better as each attempt to run a 
benchmark had different winners. It would be a 1 win, 1 draw and 1 lose for C 
vs B in the order listed above.

The original is worse in all scenarios. Whichever test method is most 
applicable to real world scenarios where more work is done with the random 
number that is generated I cannot say. So let's do a test of some less trivial 
work:
{code:java}
public void shuffle(Sources sources, Blackhole bh) {
    UniformRandomProvider rng = sources.getGenerator();
    // Work array
    int[] array = new int[numValues2];
    for (int j = numValues2; j-- > 0;) {
        array[j] = j;
    }
    for (int i = numValues; i-- > 0;) {
        for (int j = numValues2; j-- > 0;) {
            // Fischer-Yates shuffle
            int k = rng.nextInt(j + 1);
            int tmp = array[j];
            array[j] = array[k];
            array[k] = tmp;
        }
    }
    bh.consume(array[0]);
}
{code}
!Shuffle.jpg!

So the new methods (B & C) are preferable here with method C the marginal 
winner.

 

> MultiplyWithCarry256
> --------------------
>
>                 Key: RNG-87
>                 URL: https://issues.apache.org/jira/browse/RNG-87
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>              Labels: performance
>         Attachments: MWC_256.jpg, MWC_256_2.jpg, MWC_256_3.jpg, 
> MWC_256_3.png, Shuffle.jpg
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{MultiplyWithCarry256}} currently uses a length 256 array for internal 
> state. This is cycled through continuously. The current implementation 
> refills the state every 256 calls to next:
> {code:java}
> public int next() {
>     if (index == Q_SIZE) { // Whole state used up.
>         // Refill.
>         for (int i = 0; i < Q_SIZE; i++) {
>             final long t = A * (state[i] & 0xffffffffL) + carry;
>             carry = (int) (t >> 32);
>             state[i] = (int) t;
>         }
>         // Reset current index.
>         index = 0;
>     }
>     return state[index++];
> }
> {code}
> This can be changed to continuously update the state for only a single index 
> on each call to next:
> {code:java}
> public int next() {
>     // Produce an index in the range 0-255
>     final int i = index++ & 0xFF;
>     final long t = A * (state[i] & 0xffffffffL) + carry;
>     carry = (int) (t >> 32);
>     return state[i] = (int) t;
> }
> {code}
> This avoids an {{if}} statement check and *marginally* improves performance. 
> It has the advantage of not advancing the state further ahead than necessary.
> MWC_256B is the new version. JMH results from the GenerationPerformance 
> benchmark.
> {noformat}
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00478|4.45e-05|0.00477|
> |1|MWC_256|nextInt|0.00521|1.69e-05|0.00521|
> |1|MWC_256B|nextInt|0.00519|0.000321|0.00512|
> |100|SPLIT_MIX_64|nextInt|0.421|0.0131|0.418|
> |100|MWC_256|nextInt|0.452|0.000968|0.452|
> |100|MWC_256B|nextInt|0.443|0.00183|0.442|
> |10000|SPLIT_MIX_64|nextInt|41.7|0.0725|41.7|
> |10000|MWC_256|nextInt|44.5|2.25|43.9|
> |10000|MWC_256B|nextInt|42.6|0.037|42.6|
> |1000000|SPLIT_MIX_64|nextInt|3.77e+03|21.2|3.77e+03|
> |1000000|MWC_256|nextInt|4.16e+03|12.8|4.16e+03|
> |1000000|MWC_256B|nextInt|3.98e+03|120|3.96e+03|
> {noformat}
> java version "11.0.2" 2019-01-15 LTS
> Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS)
> Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00469|4.98e-06|0.00469|
> |1|MWC_256|nextInt|0.00553|2.09e-05|0.00553|
> |1|MWC_256B|nextInt|0.00493|4.32e-05|0.00492|
> |100|SPLIT_MIX_64|nextInt|0.435|0.000624|0.435|
> |100|MWC_256|nextInt|0.467|0.00284|0.466|
> |100|MWC_256B|nextInt|0.455|0.00105|0.455|
> |10000|SPLIT_MIX_64|nextInt|43|4.94|41.9|
> |10000|MWC_256|nextInt|45.8|0.132|45.8|
> |10000|MWC_256B|nextInt|43|0.033|43|
> |1000000|SPLIT_MIX_64|nextInt|3.7e+03|7.88|3.7e+03|
> |1000000|MWC_256|nextInt|4.17e+03|45.3|4.15e+03|
> |1000000|MWC_256B|nextInt|4.12e+03|4.76|4.12e+03|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to