[ 
https://issues.apache.org/jira/browse/RNG-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867772#comment-16867772
 ] 

Alex D Herbert commented on RNG-104:
------------------------------------

I have done some more analysis of seed array creation. The generator was either 
synchronized using a synchronized block or using an instance of 
{{java.util.concurrent.ReentrantLock}} with either a fair or unfair policy. For 
each synchronization a block of numbers was computed.
h2. Locks vs synchronized

The use of {{synchronized}} results in variable timing results.

!well_sync_performance.jpg!

Compare this to the use of the {{ReentrantLock}}:

!well_unfair_performance.jpg!

Here the relative time decreases in an expected fashion as the block size 
increases and the number of synchronisations reduces.

Other timings made using the unfair version of the {{ReentrantLock}} are all 
consistent. Doubling the size of the seed results in doubling of timings. This 
is not true for the synchronized block and timings are not consistent. The 
inconsistencies are typically that the time is much faster than would be 
expected. Thus the JVM is performing some optimisation for the single thread 
benchmarks with synchronized blocks to allow the exclusive lock on the object 
to be regained efficiently.

Thus we see that on a single thread the {{ReentrantLock}} is slower than 
synchronized:

!well_lock_vs_sync1.png!

But for 4 threads the {{ReentrantLock}} with an unfair policy moves ahead of 
the synchronized block:

!well_lock_vs_sync4.jpg!

The relative performance is highly variable but the lock is typically faster. A 
key point is the lock is comparable to the synchronized block on 1 thread when 
the RNG is used in blocks, and better when used on multiple threads.

Note that the {{ReentrantLock}} allows setting a fair policy where it makes 
threads wait in line. This is comparable to the unfair policy on a single 
thread but slow when there is thread contention:

!well_unfair_vs_fair.jpg!

Only when the block size is large and the system has less synchronization 
events to process does the fair policy compare well to the unfair policy. Note 
that the unfair policy is performing similar thread selection as the 
synchronized block and the two are considered comparable.
h1. Well44497B vs XorShift1024SPhi

Here is the table of compute times for different seeds on a single thread with 
no synchronisation:
||Seed|Size||WELL_44497_B||XOR_SHIFT_1024_S_PHI||Relative||
|int[]|2|17.55|6.65|0.38|
|int[]|4|29.25|9.84|0.34|
|int[]|8|52.34|17.56|0.34|
|int[]|16|98.40|31.54|0.32|
|int[]|32|197.01|68.01|0.35|
|int[]|64|380.84|120.60|0.32|
|int[]|128|749.06|250.58|0.33|
|long[]|2|41.12|8.34|0.20|
|long[]|4|78.53|12.10|0.15|
|long[]|8|159.10|21.21|0.13|
|long[]|16|309.06|43.63|0.14|
|long[]|32|610.77|83.59|0.14|
|long[]|64|1,222.49|168.82|0.14|
|long[]|128|2,497.24|406.34|0.16|

So the XorShift1024 generator is much faster. This performance improvement is 
similar when done using times to create seeds on 1 or 4 threads so I will not 
show the results.
h1. Optimal Block Size

Here are the relative performance for creating a seed on a single thread using 
an unfair lock verses creating with no concurrency provision:

!xor_unfair_int1.png!

!xor_unfair_long1.png!

So using a single synchronisation on each call to produce a value results in 
most of the time being used to do synchronisation.

If we set the limit that the SeedFactory should spend 50% of the time 
synchronising and 50% of the time creating random values then the block size 
should be 4 {{long}} and 8 {{int}} to be computed in a single block (i.e. where 
the plot relative time is approximately 2).
h1. Seed Quality

This was discussed on the dev ML and this thread. I repeat here.

For a XOR_SHIFT_1024_S_PHI generator the period is 2^1024. If sampled 2^30 
times per second (>10 billion/sec), given 31449600 secs/year or approx 2^25, it 
will take 2^969 years to repeat the period.

Basically it would be impossible to exhaust the period in a single execution of 
the application. This has the advantage that the output of the XorShift1024SPhi 
generator natively passes BigCrush. The Well44497B does not unless mixed with 
another generator:
{noformat}
RNG                                 Dieharder   TestU01 (BigCrush)
WELL_44497_B                        0,0,0,0,0   2,3,2,2,2
WELL_44497_B ^ hash code            0,0,0,0,0   2,2,2,4,2
WELL_44497_B ^ ThreadLocalRandom    0,0,0,0,0   0,0,0,1,0
WELL_44497_B ^ SplitMix64           0,0,0,0,0   2,0,0,0,1
{noformat}
Note the test failures for results of the SplitMix64 are on different tests. 
Only the plain Well44497B generator or the mix with the hash code 
systematically fail tests.
h1. Conclusions

Modifications can be made to the SeedFactory to improve seeding performance:
 * Switch to a XorShift1024SPhi generator
 * Avoid {{synchronized}} and use a {{ReentrantLock}} for stable scaling 
performance (1)
 * Compute any int[] in blocks of 8
 * Compute any long[] in blocks of 4
 * Remove the mix of output with a hashcode 

(1) Note that use of a {{ReentrantLock}} with a block size of 4 long or 8 int 
would have very similar performance to synchronized for a single thread and is 
faster for 4 threads in an extreme test of concurrency.

 

> SeedFactory seed creation performance analysis
> ----------------------------------------------
>
>                 Key: RNG-104
>                 URL: https://issues.apache.org/jira/browse/RNG-104
>             Project: Commons RNG
>          Issue Type: Task
>          Components: simple
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>         Attachments: t1.jpg, t4.jpg, well_lock_vs_sync1.png, 
> well_lock_vs_sync4.jpg, well_sync_performance.jpg, 
> well_unfair_performance.jpg, well_unfair_vs_fair.jpg, xor_unfair_int1.png, 
> xor_unfair_long1.png
>
>
> The SeedFactory is used to create seeds for the random generators. To ensure 
> thread safety this uses synchronized blocks around a single generator. The 
> current method only generates a single int or long per synchronisation. 
> Analyze the performance of this approach. The analysis will investigate 
> generating multiple values inside each synchronisation around the generator.
> This analysis will also investigate methods to supplement the SeedFactory 
> with fast methods to create seeds. This will use a fast seeding method to 
> generate a single long value. This can be a seed for a SplitMix generator 
> used to create a seed of any length.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to