[jira] [Commented] (RNG-87) MultiplyWithCarry256

Alex D Herbert (JIRA) Tue, 26 Mar 2019 10:31:06 -0700


    [ 
https://issues.apache.org/jira/browse/RNG-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801978#comment-16801978
 ]


Alex D Herbert commented on RNG-87:
-----------------------------------

I've done some more investigating with generating {{int}}, {{float}}, {{long}}, 
{{double}}.

My assumption was that it would be about twice as long to generate a {{long}} 
vs an {{int}}.

Not with the current benchmark:
||numValues||randomSourceName||Method||Score||Error||Median||
|1000000|MWC_256|nextInt|4.21e+03|7.88|4.21e+03|
|1000000|MWC_256|nextLong|6.39e+03|44.4|6.38e+03|
|1000000|MWC_256B|nextInt|4.12e+03|7.98|4.12e+03|
|1000000|MWC_256B|nextLong|5.91e+03|15.7|5.91e+03|
|1000000|MWC_256C|nextInt|4.27e+03|6.62|4.27e+03|
|1000000|MWC_256C|nextLong|5.85e+03|22.6|5.85e+03|

According to this there is only about 50% more effort needed to build twice as 
many numbers.

The current benchmark does this:
{code:java}
public void nextInt(Sources sources,
                    Blackhole bh) {
    for (int i = 0; i < numValues; i++) {
        bh.consume(sources.getGenerator().nextInt());
    }
}
{code}
The internals of the Blackhole consume method do this:
{code:java}
    public final void consume(int i) {
        int i1 = this.i1; // volatile read
        int i2 = this.i2;
        if ((i ^ i1) == (i ^ i2)) {
            // SHOULD NEVER HAPPEN
            nullBait.i1 = i; // implicit null pointer exception
        }
    }
{code}
So there is quite a bit of overhead in consume. I tried a few variants:
{code:java}
public void nextInt1(Sources sources, Blackhole bh) {
    int value = 0;
    UniformRandomProvider rng = sources.getGenerator(); 
    for (int i = 0; i < numValues; i++) {
        value = rng.nextInt();
    }
    bh.consume(value);
}

public void nextInt2(Sources sources, Blackhole bh) {
    int value = 0;
    UniformRandomProvider rng = sources.getGenerator(); 
    for (int i = 0; i < numValues; i++) {
        value ^= rng.nextInt();
    }
    bh.consume(value);
}

public int nextInt3(Sources sources) {
    int value = 0;
    UniformRandomProvider rng = sources.getGenerator(); 
    for (int i = 0; i < numValues; i++) {
        value = rng.nextInt();
    }
    return value;
}
{code}
And also the corresponding version for nextLong. Here are the results:
||numValues||randomSourceName||Method||Score||Error||Median||
|1000000|MWC_256|nextInt|4206.531|7.880|4205.816|
|1000000|MWC_256|nextLong|6385.291|44.398|6381.443|
|1000000|MWC_256B|nextInt|4120.465|7.983|4121.483|
|1000000|MWC_256B|nextLong|5909.085|15.654|5909.844|
|1000000|MWC_256C|nextInt|4267.758|6.621|4268.275|
|1000000|MWC_256C|nextLong|5854.299|22.565|5851.805|
|1000000|MWC_256|nextInt1|2740.444|5.266|2740.126|
|1000000|MWC_256|nextLong1|4272.564|5.007|4272.492|
|1000000|MWC_256B|nextInt1|1196.116|4.460|1195.403|
|1000000|MWC_256B|nextLong1|2415.398|4.686|2414.995|
|1000000|MWC_256C|nextInt1|1195.527|2.045|1195.529|
|1000000|MWC_256C|nextLong1|2171.363|105.871|2152.932|
|1000000|MWC_256|nextInt2|2778.835|46.560|2771.455|
|1000000|MWC_256|nextLong2|4299.204|5.074|4299.135|
|1000000|MWC_256B|nextInt2|1197.354|3.171|1197.491|
|1000000|MWC_256B|nextLong2|2947.189|6.562|2947.029|
|1000000|MWC_256C|nextInt2|1198.200|5.077|1197.895|
|1000000|MWC_256C|nextLong2|2410.498|4.040|2410.655|
|1000000|MWC_256|nextInt3|2625.409|8.581|2625.232|
|1000000|MWC_256|nextLong3|4107.880|3.337|4107.783|
|1000000|MWC_256B|nextInt3|1196.417|3.127|1196.277|
|1000000|MWC_256B|nextLong3|2417.205|2.691|2417.508|
|1000000|MWC_256C|nextInt3|1195.608|3.154|1195.679|
|1000000|MWC_256C|nextLong3|2132.364|3.225|2132.588|

So the new methods (B & C) variants *without the consume* in the loop seem to 
scale approximately 2-fold between {{int}} and {{long}}. The original version 
still scales at about 60% for next long. I do not understand this. I am 
attributing it to how the method for two successive calls can be inlined.

The new testing methods do not really differ for int (same score) but using the 
xor operation is expensive for long and so is noticed as shown below (method 2).
||numValues||randomSourceName||Method||Score||Error||Median||
|1000000|MWC_256C|nextInt|4267.758|6.621|4268.275|
|1000000|MWC_256C|nextInt1|1195.527|2.045|1195.529|
|1000000|MWC_256C|nextInt2|1198.200|5.077|1197.895|
|1000000|MWC_256C|nextInt3|1195.608|3.154|1195.679|
|1000000|MWC_256C|nextLong|5854.299|22.565|5851.805|
|1000000|MWC_256C|nextLong1|2171.363|105.871|2152.932|
|1000000|MWC_256C|nextLong2|2410.498|4.040|2410.655|
|1000000|MWC_256C|nextLong3|2132.364|3.225|2132.588| |

Note the loop does still execute when not consuming the values using the 
Blackhole:
||numValues||randomSourceName||Method||Score||Error||Median||
|1|MWC_256C|nextInt|0.006|0.000|0.006|
|100|MWC_256C|nextInt|0.129|0.001|0.129|
|10000|MWC_256C|nextInt|11.971|0.010|11.971|
|1000000|MWC_256C|nextInt|1194.277|2.786|1194.664|

This scales by the appropriate factor of 100.

So I think that the consume in the loop is not the correct use of the Blackhole.

Using the above knowledge I benchmark using method1 which basically stores the 
last value from the RNG and then consumes. Here are the results:
|JDK|numValues|randomSourceName|Method|Score|Error|Median|Rel Score|Rel Median|
|1.8.0_191|1000000|MWC_256|nextDouble|4,373.98|12.06|4,373.41|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextDouble|2,407.52|19.91|2,403.94|0.55|0.55|
|1.8.0_191|1000000|MWC_256C|nextDouble|2,153.73|8.23|2,152.94|0.49|0.49|
|1.8.0_191|1000000|MWC_256|nextFloat|2,751.41|5.15|2,751.15|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextFloat|1,071.76|1.78|1,071.80|0.39|0.39|
|1.8.0_191|1000000|MWC_256C|nextFloat|1,266.48|579.94|1,199.22|0.46|0.44|
|1.8.0_191|1000000|MWC_256|nextInt|2,618.14|2.89|2,618.11|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextInt|1,114.27|2.65|1,113.87|0.43|0.43|
|1.8.0_191|1000000|MWC_256C|nextInt|1,264.59|558.13|1,200.45|0.48|0.46|
|1.8.0_191|1000000|MWC_256|nextLong|4,074.60|12.67|4,073.33|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextLong|2,399.82|5.02|2,400.10|0.59|0.59|
|1.8.0_191|1000000|MWC_256C|nextLong|2,140.19|5.41|2,139.97|0.53|0.53|
|11.0.2|1000000|MWC_256|nextDouble|4,289.80|11.50|4,290.45|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextDouble|2,413.51|23.84|2,411.10|0.56|0.56|
|11.0.2|1000000|MWC_256C|nextDouble|2,124.29|4.34|2,124.28|0.50|0.50|
|11.0.2|1000000|MWC_256|nextFloat|2,780.56|6.90|2,781.70|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextFloat|1,070.17|2.44|1,070.10|0.38|0.38|
|11.0.2|1000000|MWC_256C|nextFloat|1,193.96|2.35|1,194.06|0.43|0.43|
|11.0.2|1000000|MWC_256|nextInt|2,619.12|26.63|2,616.57|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextInt|1,070.26|1.28|1,070.25|0.41|0.41|
|11.0.2|1000000|MWC_256C|nextInt|1,194.57|9.82|1,193.52|0.46|0.46|
|11.0.2|1000000|MWC_256|nextLong|3,937.55|28.57|3,934.80|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextLong|2,399.09|4.14|2,399.08|0.61|0.61|
|11.0.2|1000000|MWC_256C|nextLong|2,131.59|6.06|2,131.25|0.54|0.54|

So on both versions of Java the new methods are much faster than the old.

Interestingly version B is faster for int and float generation. Version C is 
faster for long and double. This may be how the JIT compiler is able to inline 
two successive calls to the method in the following snippet:
{code:java}
public long nextLong() {
    return NumberFactory.makeLong(nextInt(), nextInt());
}
{code}
I've also tried using {{@BenchmarkMode(Mode.SampleTime)}} using 100 samples as 
follows (no formatting script so the raw JMH output from JDK 1.8 is shown).
{noformat}
nextDouble    100      MWC_256  sample  142908   1113.145 ± 11.088  ns/op
nextDouble    100     MWC_256B  sample  109369    969.114 ±  5.782  ns/op
nextDouble    100     MWC_256C  sample  138175    795.329 ±  2.853  ns/op
nextFloat     100      MWC_256  sample  190012    875.837 ±  2.564  ns/op
nextFloat     100     MWC_256B  sample  131789    693.046 ±  2.920  ns/op
nextFloat     100     MWC_256C  sample  118648    705.779 ±  3.291  ns/op
nextInt       100      MWC_256  sample  102782    847.591 ±  4.087  ns/op
nextInt       100     MWC_256B  sample  123291    695.747 ±  3.184  ns/op
nextInt       100     MWC_256C  sample  119116    709.791 ±  3.386  ns/op
nextLong      100      MWC_256  sample  159240    982.147 ±  2.749  ns/op
nextLong      100     MWC_256B  sample  126653    823.530 ±  3.045  ns/op
nextLong      100     MWC_256C  sample  137978    802.975 ±  3.027  ns/op
{noformat}
I am reasonably convinced that:
 * MWC_256C is better here. It seems the same for int/float as MWC_256B and is 
faster for long/double
 * Benchmarking should not consume values within the loop

I will test another platform later.

 

> MultiplyWithCarry256
> --------------------
>
>                 Key: RNG-87
>                 URL: https://issues.apache.org/jira/browse/RNG-87
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>              Labels: performance
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{MultiplyWithCarry256}} currently uses a length 256 array for internal 
> state. This is cycled through continuously. The current implementation 
> refills the state every 256 calls to next:
> {code:java}
> public int next() {
>     if (index == Q_SIZE) { // Whole state used up.
>         // Refill.
>         for (int i = 0; i < Q_SIZE; i++) {
>             final long t = A * (state[i] & 0xffffffffL) + carry;
>             carry = (int) (t >> 32);
>             state[i] = (int) t;
>         }
>         // Reset current index.
>         index = 0;
>     }
>     return state[index++];
> }
> {code}
> This can be changed to continuously update the state for only a single index 
> on each call to next:
> {code:java}
> public int next() {
>     // Produce an index in the range 0-255
>     final int i = index++ & 0xFF;
>     final long t = A * (state[i] & 0xffffffffL) + carry;
>     carry = (int) (t >> 32);
>     return state[i] = (int) t;
> }
> {code}
> This avoids an {{if}} statement check and *marginally* improves performance. 
> It has the advantage of not advancing the state further ahead than necessary.
> MWC_256B is the new version. JMH results from the GenerationPerformance 
> benchmark.
> {noformat}
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00478|4.45e-05|0.00477|
> |1|MWC_256|nextInt|0.00521|1.69e-05|0.00521|
> |1|MWC_256B|nextInt|0.00519|0.000321|0.00512|
> |100|SPLIT_MIX_64|nextInt|0.421|0.0131|0.418|
> |100|MWC_256|nextInt|0.452|0.000968|0.452|
> |100|MWC_256B|nextInt|0.443|0.00183|0.442|
> |10000|SPLIT_MIX_64|nextInt|41.7|0.0725|41.7|
> |10000|MWC_256|nextInt|44.5|2.25|43.9|
> |10000|MWC_256B|nextInt|42.6|0.037|42.6|
> |1000000|SPLIT_MIX_64|nextInt|3.77e+03|21.2|3.77e+03|
> |1000000|MWC_256|nextInt|4.16e+03|12.8|4.16e+03|
> |1000000|MWC_256B|nextInt|3.98e+03|120|3.96e+03|
> {noformat}
> java version "11.0.2" 2019-01-15 LTS
> Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS)
> Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00469|4.98e-06|0.00469|
> |1|MWC_256|nextInt|0.00553|2.09e-05|0.00553|
> |1|MWC_256B|nextInt|0.00493|4.32e-05|0.00492|
> |100|SPLIT_MIX_64|nextInt|0.435|0.000624|0.435|
> |100|MWC_256|nextInt|0.467|0.00284|0.466|
> |100|MWC_256B|nextInt|0.455|0.00105|0.455|
> |10000|SPLIT_MIX_64|nextInt|43|4.94|41.9|
> |10000|MWC_256|nextInt|45.8|0.132|45.8|
> |10000|MWC_256B|nextInt|43|0.033|43|
> |1000000|SPLIT_MIX_64|nextInt|3.7e+03|7.88|3.7e+03|
> |1000000|MWC_256|nextInt|4.17e+03|45.3|4.15e+03|
> |1000000|MWC_256B|nextInt|4.12e+03|4.76|4.12e+03|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (RNG-87) MultiplyWithCarry256

Reply via email to