[
https://issues.apache.org/jira/browse/RNG-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801978#comment-16801978
]
Alex D Herbert commented on RNG-87:
-----------------------------------
I've done some more investigating with generating {{int}}, {{float}}, {{long}},
{{double}}.
My assumption was that it would be about twice as long to generate a {{long}}
vs an {{int}}.
Not with the current benchmark:
||numValues||randomSourceName||Method||Score||Error||Median||
|1000000|MWC_256|nextInt|4.21e+03|7.88|4.21e+03|
|1000000|MWC_256|nextLong|6.39e+03|44.4|6.38e+03|
|1000000|MWC_256B|nextInt|4.12e+03|7.98|4.12e+03|
|1000000|MWC_256B|nextLong|5.91e+03|15.7|5.91e+03|
|1000000|MWC_256C|nextInt|4.27e+03|6.62|4.27e+03|
|1000000|MWC_256C|nextLong|5.85e+03|22.6|5.85e+03|
According to this there is only about 50% more effort needed to build twice as
many numbers.
The current benchmark does this:
{code:java}
public void nextInt(Sources sources,
Blackhole bh) {
for (int i = 0; i < numValues; i++) {
bh.consume(sources.getGenerator().nextInt());
}
}
{code}
The internals of the Blackhole consume method do this:
{code:java}
public final void consume(int i) {
int i1 = this.i1; // volatile read
int i2 = this.i2;
if ((i ^ i1) == (i ^ i2)) {
// SHOULD NEVER HAPPEN
nullBait.i1 = i; // implicit null pointer exception
}
}
{code}
So there is quite a bit of overhead in consume. I tried a few variants:
{code:java}
public void nextInt1(Sources sources, Blackhole bh) {
int value = 0;
UniformRandomProvider rng = sources.getGenerator();
for (int i = 0; i < numValues; i++) {
value = rng.nextInt();
}
bh.consume(value);
}
public void nextInt2(Sources sources, Blackhole bh) {
int value = 0;
UniformRandomProvider rng = sources.getGenerator();
for (int i = 0; i < numValues; i++) {
value ^= rng.nextInt();
}
bh.consume(value);
}
public int nextInt3(Sources sources) {
int value = 0;
UniformRandomProvider rng = sources.getGenerator();
for (int i = 0; i < numValues; i++) {
value = rng.nextInt();
}
return value;
}
{code}
And also the corresponding version for nextLong. Here are the results:
||numValues||randomSourceName||Method||Score||Error||Median||
|1000000|MWC_256|nextInt|4206.531|7.880|4205.816|
|1000000|MWC_256|nextLong|6385.291|44.398|6381.443|
|1000000|MWC_256B|nextInt|4120.465|7.983|4121.483|
|1000000|MWC_256B|nextLong|5909.085|15.654|5909.844|
|1000000|MWC_256C|nextInt|4267.758|6.621|4268.275|
|1000000|MWC_256C|nextLong|5854.299|22.565|5851.805|
|1000000|MWC_256|nextInt1|2740.444|5.266|2740.126|
|1000000|MWC_256|nextLong1|4272.564|5.007|4272.492|
|1000000|MWC_256B|nextInt1|1196.116|4.460|1195.403|
|1000000|MWC_256B|nextLong1|2415.398|4.686|2414.995|
|1000000|MWC_256C|nextInt1|1195.527|2.045|1195.529|
|1000000|MWC_256C|nextLong1|2171.363|105.871|2152.932|
|1000000|MWC_256|nextInt2|2778.835|46.560|2771.455|
|1000000|MWC_256|nextLong2|4299.204|5.074|4299.135|
|1000000|MWC_256B|nextInt2|1197.354|3.171|1197.491|
|1000000|MWC_256B|nextLong2|2947.189|6.562|2947.029|
|1000000|MWC_256C|nextInt2|1198.200|5.077|1197.895|
|1000000|MWC_256C|nextLong2|2410.498|4.040|2410.655|
|1000000|MWC_256|nextInt3|2625.409|8.581|2625.232|
|1000000|MWC_256|nextLong3|4107.880|3.337|4107.783|
|1000000|MWC_256B|nextInt3|1196.417|3.127|1196.277|
|1000000|MWC_256B|nextLong3|2417.205|2.691|2417.508|
|1000000|MWC_256C|nextInt3|1195.608|3.154|1195.679|
|1000000|MWC_256C|nextLong3|2132.364|3.225|2132.588|
So the new methods (B & C) variants *without the consume* in the loop seem to
scale approximately 2-fold between {{int}} and {{long}}. The original version
still scales at about 60% for next long. I do not understand this. I am
attributing it to how the method for two successive calls can be inlined.
The new testing methods do not really differ for int (same score) but using the
xor operation is expensive for long and so is noticed as shown below (method 2).
||numValues||randomSourceName||Method||Score||Error||Median||
|1000000|MWC_256C|nextInt|4267.758|6.621|4268.275|
|1000000|MWC_256C|nextInt1|1195.527|2.045|1195.529|
|1000000|MWC_256C|nextInt2|1198.200|5.077|1197.895|
|1000000|MWC_256C|nextInt3|1195.608|3.154|1195.679|
|1000000|MWC_256C|nextLong|5854.299|22.565|5851.805|
|1000000|MWC_256C|nextLong1|2171.363|105.871|2152.932|
|1000000|MWC_256C|nextLong2|2410.498|4.040|2410.655|
|1000000|MWC_256C|nextLong3|2132.364|3.225|2132.588| |
Note the loop does still execute when not consuming the values using the
Blackhole:
||numValues||randomSourceName||Method||Score||Error||Median||
|1|MWC_256C|nextInt|0.006|0.000|0.006|
|100|MWC_256C|nextInt|0.129|0.001|0.129|
|10000|MWC_256C|nextInt|11.971|0.010|11.971|
|1000000|MWC_256C|nextInt|1194.277|2.786|1194.664|
This scales by the appropriate factor of 100.
So I think that the consume in the loop is not the correct use of the Blackhole.
Using the above knowledge I benchmark using method1 which basically stores the
last value from the RNG and then consumes. Here are the results:
|JDK|numValues|randomSourceName|Method|Score|Error|Median|Rel Score|Rel Median|
|1.8.0_191|1000000|MWC_256|nextDouble|4,373.98|12.06|4,373.41|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextDouble|2,407.52|19.91|2,403.94|0.55|0.55|
|1.8.0_191|1000000|MWC_256C|nextDouble|2,153.73|8.23|2,152.94|0.49|0.49|
|1.8.0_191|1000000|MWC_256|nextFloat|2,751.41|5.15|2,751.15|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextFloat|1,071.76|1.78|1,071.80|0.39|0.39|
|1.8.0_191|1000000|MWC_256C|nextFloat|1,266.48|579.94|1,199.22|0.46|0.44|
|1.8.0_191|1000000|MWC_256|nextInt|2,618.14|2.89|2,618.11|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextInt|1,114.27|2.65|1,113.87|0.43|0.43|
|1.8.0_191|1000000|MWC_256C|nextInt|1,264.59|558.13|1,200.45|0.48|0.46|
|1.8.0_191|1000000|MWC_256|nextLong|4,074.60|12.67|4,073.33|1.00|1.00|
|1.8.0_191|1000000|MWC_256B|nextLong|2,399.82|5.02|2,400.10|0.59|0.59|
|1.8.0_191|1000000|MWC_256C|nextLong|2,140.19|5.41|2,139.97|0.53|0.53|
|11.0.2|1000000|MWC_256|nextDouble|4,289.80|11.50|4,290.45|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextDouble|2,413.51|23.84|2,411.10|0.56|0.56|
|11.0.2|1000000|MWC_256C|nextDouble|2,124.29|4.34|2,124.28|0.50|0.50|
|11.0.2|1000000|MWC_256|nextFloat|2,780.56|6.90|2,781.70|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextFloat|1,070.17|2.44|1,070.10|0.38|0.38|
|11.0.2|1000000|MWC_256C|nextFloat|1,193.96|2.35|1,194.06|0.43|0.43|
|11.0.2|1000000|MWC_256|nextInt|2,619.12|26.63|2,616.57|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextInt|1,070.26|1.28|1,070.25|0.41|0.41|
|11.0.2|1000000|MWC_256C|nextInt|1,194.57|9.82|1,193.52|0.46|0.46|
|11.0.2|1000000|MWC_256|nextLong|3,937.55|28.57|3,934.80|1.00|1.00|
|11.0.2|1000000|MWC_256B|nextLong|2,399.09|4.14|2,399.08|0.61|0.61|
|11.0.2|1000000|MWC_256C|nextLong|2,131.59|6.06|2,131.25|0.54|0.54|
So on both versions of Java the new methods are much faster than the old.
Interestingly version B is faster for int and float generation. Version C is
faster for long and double. This may be how the JIT compiler is able to inline
two successive calls to the method in the following snippet:
{code:java}
public long nextLong() {
return NumberFactory.makeLong(nextInt(), nextInt());
}
{code}
I've also tried using {{@BenchmarkMode(Mode.SampleTime)}} using 100 samples as
follows (no formatting script so the raw JMH output from JDK 1.8 is shown).
{noformat}
nextDouble 100 MWC_256 sample 142908 1113.145 ± 11.088 ns/op
nextDouble 100 MWC_256B sample 109369 969.114 ± 5.782 ns/op
nextDouble 100 MWC_256C sample 138175 795.329 ± 2.853 ns/op
nextFloat 100 MWC_256 sample 190012 875.837 ± 2.564 ns/op
nextFloat 100 MWC_256B sample 131789 693.046 ± 2.920 ns/op
nextFloat 100 MWC_256C sample 118648 705.779 ± 3.291 ns/op
nextInt 100 MWC_256 sample 102782 847.591 ± 4.087 ns/op
nextInt 100 MWC_256B sample 123291 695.747 ± 3.184 ns/op
nextInt 100 MWC_256C sample 119116 709.791 ± 3.386 ns/op
nextLong 100 MWC_256 sample 159240 982.147 ± 2.749 ns/op
nextLong 100 MWC_256B sample 126653 823.530 ± 3.045 ns/op
nextLong 100 MWC_256C sample 137978 802.975 ± 3.027 ns/op
{noformat}
I am reasonably convinced that:
* MWC_256C is better here. It seems the same for int/float as MWC_256B and is
faster for long/double
* Benchmarking should not consume values within the loop
I will test another platform later.
> MultiplyWithCarry256
> --------------------
>
> Key: RNG-87
> URL: https://issues.apache.org/jira/browse/RNG-87
> Project: Commons RNG
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.3
> Reporter: Alex D Herbert
> Assignee: Alex D Herbert
> Priority: Minor
> Labels: performance
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The {{MultiplyWithCarry256}} currently uses a length 256 array for internal
> state. This is cycled through continuously. The current implementation
> refills the state every 256 calls to next:
> {code:java}
> public int next() {
> if (index == Q_SIZE) { // Whole state used up.
> // Refill.
> for (int i = 0; i < Q_SIZE; i++) {
> final long t = A * (state[i] & 0xffffffffL) + carry;
> carry = (int) (t >> 32);
> state[i] = (int) t;
> }
> // Reset current index.
> index = 0;
> }
> return state[index++];
> }
> {code}
> This can be changed to continuously update the state for only a single index
> on each call to next:
> {code:java}
> public int next() {
> // Produce an index in the range 0-255
> final int i = index++ & 0xFF;
> final long t = A * (state[i] & 0xffffffffL) + carry;
> carry = (int) (t >> 32);
> return state[i] = (int) t;
> }
> {code}
> This avoids an {{if}} statement check and *marginally* improves performance.
> It has the advantage of not advancing the state further ahead than necessary.
> MWC_256B is the new version. JMH results from the GenerationPerformance
> benchmark.
> {noformat}
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00478|4.45e-05|0.00477|
> |1|MWC_256|nextInt|0.00521|1.69e-05|0.00521|
> |1|MWC_256B|nextInt|0.00519|0.000321|0.00512|
> |100|SPLIT_MIX_64|nextInt|0.421|0.0131|0.418|
> |100|MWC_256|nextInt|0.452|0.000968|0.452|
> |100|MWC_256B|nextInt|0.443|0.00183|0.442|
> |10000|SPLIT_MIX_64|nextInt|41.7|0.0725|41.7|
> |10000|MWC_256|nextInt|44.5|2.25|43.9|
> |10000|MWC_256B|nextInt|42.6|0.037|42.6|
> |1000000|SPLIT_MIX_64|nextInt|3.77e+03|21.2|3.77e+03|
> |1000000|MWC_256|nextInt|4.16e+03|12.8|4.16e+03|
> |1000000|MWC_256B|nextInt|3.98e+03|120|3.96e+03|
> {noformat}
> java version "11.0.2" 2019-01-15 LTS
> Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS)
> Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)
> {noformat}
> ||numValues||randomSourceName||Method||Score||Error||Median||
> |1|SPLIT_MIX_64|nextInt|0.00469|4.98e-06|0.00469|
> |1|MWC_256|nextInt|0.00553|2.09e-05|0.00553|
> |1|MWC_256B|nextInt|0.00493|4.32e-05|0.00492|
> |100|SPLIT_MIX_64|nextInt|0.435|0.000624|0.435|
> |100|MWC_256|nextInt|0.467|0.00284|0.466|
> |100|MWC_256B|nextInt|0.455|0.00105|0.455|
> |10000|SPLIT_MIX_64|nextInt|43|4.94|41.9|
> |10000|MWC_256|nextInt|45.8|0.132|45.8|
> |10000|MWC_256B|nextInt|43|0.033|43|
> |1000000|SPLIT_MIX_64|nextInt|3.7e+03|7.88|3.7e+03|
> |1000000|MWC_256|nextInt|4.17e+03|45.3|4.15e+03|
> |1000000|MWC_256B|nextInt|4.12e+03|4.76|4.12e+03|
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)