[jira] [Comment Edited] (RNG-50) PoissonSampler single use speed improvements

Alex D Herbert (JIRA) Tue, 31 Jul 2018 04:41:14 -0700


    [ 
https://issues.apache.org/jira/browse/RNG-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562006#comment-16562006
 ]


Alex D Herbert edited comment on RNG-50 at 7/31/18 11:40 AM:
-------------------------------------------------------------

I've made a simple Maven project that tests the use of the {{log(n!}} function:

[Poisson Sampler Test 
Code|https://github.com/aherbert/poisson-sampler-test-code]

Since the function outside the algorithm loop can be pre-computed I've ignored 
this from the statistics. The table below shows the number of calls made to 
{{log(n!)}} and the summary statistics on the distribution of {{n}}. For 
reference the expected standard deviation of all samples from the Poisson 
distribution is shown.

||Mean||Samples||log(n!) calls||Calls/sample||Min n||Max n||Av n||SD n||Poisson 
SD||
|40|2000000|15336|0.007668|11|77|42.676382368283775|14.47432293750233|6.324555320336759|
|45|2000000|13343|0.0066715|13|85|47.662144944914935|15.252118330974465|6.708203932499369|
|50|2000000|11930|0.005965|17|92|52.95188600167645|15.989613273366508|7.0710678118654755|
|55|2000000|10748|0.005374|18|100|57.59666914774842|16.894476407792787|7.416198487095663|
|60|2000000|9723|0.0048615|24|108|62.777332099146356|17.525350974045864|7.745966692414834|
|65|2000000|9005|0.0045025|31|110|67.82809550249861|18.2532893886245|8.06225774829855|
|70|2000000|8166|0.004083|34|118|72.43350477590008|19.056570931403215|8.366600265340756|
|75|2000000|7694|0.003847|38|129|77.51286716922277|19.75367031023264|8.660254037844387|
|80|2000000|6989|0.0034945|38|130|82.77063957647732|20.279865491852842|8.94427190999916|
|85|2000000|6651|0.0033255|47|134|87.96977898060442|20.8263169458762|9.219544457292887|
|90|2000000|6314|0.003157|47|144|92.9708584098828|21.38232095182317|9.486832980505138|
|95|2000000|5948|0.002974|49|145|97.51008742434432|22.124775693124754|9.746794344808963|
|100|2000000|5574|0.002787|56|151|102.7929673484033|22.699920614277534|10.0|
|200|2000000|2654|0.001327|139|267|203.6525998492841|31.75185127203293|14.142135623730951|
|400|2000000|1316|6.58E-4|315|504|403.35562310030394|44.98486072660934|20.0|
|800|2000000|657|3.285E-4|691|947|803.1278538812785|63.848062553855826|28.284271247461902|
|1600|2000000|336|1.68E-4|1442|1776|1604.6607142857142|86.91136884271837|40.0|
|3200|2000000|155|7.75E-5|2977|3429|3210.2903225806454|124.33892557133235|56.568542494923804|
|6400|2000000|74|3.7E-5|6088|6699|6378.472972972973|174.87476716507024|80.0|
|12800|2000000|58|2.9E-5|12335|13135|12803.775862068966|256.02912684516065|113.13708498984761|
|25600|2000000|19|9.5E-6|25102|26111|25542.052631578947|358.73101555410426|160.0|
|51200|2000000|10|5.0E-6|50301|51893|51056.6|605.0541206281124|226.27416997969522|
|102400|2000000|3|1.5E-6|101510|103388|102729.66666666667|1057.375209343386|320.0|

Interestingly the section of code that uses {{log(n!)}} is called about 0.7 
down to 0.3 % of the time. So the value of a cache of size 80 is debatable.

Reading the algorithm it is clear that the {{log(n!)}} is called with a value 
that is approximately the same as the sample that will be returned (it is just 
adjusted by a Poisson sample using a mean in the range of 0-1). Thus 
{{log(n!)}} is called with a value from the Poisson distribution. With high 
mean (>40) this is approximately a Gaussian and so the value of n will will be 
approximately in the range {{mean +/- 3*sqrt(mean)}}, with {{sqrt(mean)}} the 
expected standard deviation of the Poisson distribution. In the results the 
mean is slightly higher than the Poisson mean but the standard deviation is 
more than double. This is due the sub-sample of the Poisson that is used when 
calling {{log(n!)}}, i.e. this executes only part of the time for all samples.

If a cache is to be used then it could be made smarter and only the range of n 
expected given the mean should be cached. This will have some value when the 
user wants to compute thousands of samples, but the exact crossover point when 
a cache is beneficial would require use-case testing.

In light of these results perhaps the cache should be removed or made optional.

Given that a lot of the initial computation at the start of the main algorithm 
loop is the same I have created a new {{NoCachePoissonSampler}} that 
precomputes this. It actually delegates the work of sampling to an internal 
class that is specialised for small mean or large mean. This is less readable 
than the original but runs about 2x faster for big mean and is comparable for 
small mean to the existing version. It has little penalty when run inside a 
loop for a single use.
{noformat}
Mean 10  Single construction (33017430) vs Loop construction                    
      (43150013)   (1.306886.2x faster)
Mean 10  Single construction (33017430) vs Loop construction with no cache      
      (38633093)   (1.170082.2x faster)
Mean 10  Single construction (33017430) vs Single construction with no cache    
      (29106732)   (0.881557.2x faster)
Mean 15  Single construction (44054388) vs Loop construction                    
      (49272725)   (1.118452.2x faster)
Mean 15  Single construction (44054388) vs Loop construction with no cache      
      (47975491)   (1.089006.2x faster)
Mean 15  Single construction (44054388) vs Single construction with no cache    
      (40819826)   (0.926578.2x faster)
Mean 20  Single construction (49668097) vs Loop construction                    
      (56182968)   (1.131168.2x faster)
Mean 20  Single construction (49668097) vs Loop construction with no cache      
      (52115198)   (1.049269.2x faster)
Mean 20  Single construction (49668097) vs Single construction with no cache    
      (46318785)   (0.932566.2x faster)
Mean 25  Single construction (59471099) vs Loop construction                    
      (64330381)   (1.081708.2x faster)
Mean 25  Single construction (59471099) vs Loop construction with no cache      
      (60546587)   (1.018084.2x faster)
Mean 25  Single construction (59471099) vs Single construction with no cache    
      (54591773)   (0.917955.2x faster)
Mean 30  Single construction (68146512) vs Loop construction                    
      (74374546)   (1.091392.2x faster)
Mean 30  Single construction (68146512) vs Loop construction with no cache      
      (70451976)   (1.033831.2x faster)
Mean 30  Single construction (68146512) vs Single construction with no cache    
      (63940806)   (0.938284.2x faster)
Mean 35  Single construction (79878953) vs Loop construction                    
      (84966196)   (1.063687.2x faster)
Mean 35  Single construction (79878953) vs Loop construction with no cache      
      (87798333)   (1.099142.2x faster)
Mean 35  Single construction (79878953) vs Single construction with no cache    
      (89916831)   (1.125664.2x faster)
Mean 40  Single construction (44851418) vs Loop construction                    
      (150102896)   (3.346670.2x faster)
Mean 40  Single construction (44851418) vs Loop construction with no cache      
      (64463988)   (1.437279.2x faster)
Mean 40  Single construction (44851418) vs Single construction with no cache    
      (20547465)   (0.458123.2x faster)
Mean 45  Single construction (44565349) vs Loop construction                    
      (142062133)   (3.187726.2x faster)
Mean 45  Single construction (44565349) vs Loop construction with no cache      
      (56858335)   (1.275842.2x faster)
Mean 45  Single construction (44565349) vs Single construction with no cache    
      (20147275)   (0.452084.2x faster)
Mean 50  Single construction (43771462) vs Loop construction                    
      (146827129)   (3.354403.2x faster)
Mean 50  Single construction (43771462) vs Loop construction with no cache      
      (53001835)   (1.210877.2x faster)
Mean 50  Single construction (43771462) vs Single construction with no cache    
      (19761750)   (0.451476.2x faster)
Mean 55  Single construction (43445931) vs Loop construction                    
      (163112456)   (3.754378.2x faster)
Mean 55  Single construction (43445931) vs Loop construction with no cache      
      (52989752)   (1.219671.2x faster)
Mean 55  Single construction (43445931) vs Single construction with no cache    
      (19555998)   (0.450123.2x faster)
Mean 60  Single construction (43694078) vs Loop construction                    
      (166955616)   (3.821012.2x faster)
Mean 60  Single construction (43694078) vs Loop construction with no cache      
      (53799074)   (1.231267.2x faster)
Mean 60  Single construction (43694078) vs Single construction with no cache    
      (19645395)   (0.449612.2x faster)
Mean 65  Single construction (43766713) vs Loop construction                    
      (179079936)   (4.091693.2x faster)
Mean 65  Single construction (43766713) vs Loop construction with no cache      
      (55747299)   (1.273737.2x faster)
Mean 65  Single construction (43766713) vs Single construction with no cache    
      (20493758)   (0.468250.2x faster)
Mean 70  Single construction (43975538) vs Loop construction                    
      (189005650)   (4.297972.2x faster)
Mean 70  Single construction (43975538) vs Loop construction with no cache      
      (53124058)   (1.208037.2x faster)
Mean 70  Single construction (43975538) vs Single construction with no cache    
      (19395643)   (0.441055.2x faster)
Mean 75  Single construction (45913218) vs Loop construction                    
      (205133974)   (4.467863.2x faster)
Mean 75  Single construction (45913218) vs Loop construction with no cache      
      (55565851)   (1.210236.2x faster)
Mean 75  Single construction (45913218) vs Single construction with no cache    
      (19380347)   (0.422108.2x faster)
Mean 80  Single construction (43494472) vs Loop construction                    
      (204741747)   (4.707305.2x faster)
Mean 80  Single construction (43494472) vs Loop construction with no cache      
      (53357258)   (1.226760.2x faster)
Mean 80  Single construction (43494472) vs Single construction with no cache    
      (19207700)   (0.441612.2x faster)
{noformat}


was (Author: alexherbert):
I've made a simple Maven project that tests the use of the {{log(n!}} function:

[Poisson Sampler Test 
Code|https://github.com/aherbert/poisson-sampler-test-code]

Since the function outside the algorithm loop can be pre-computed I've ignored 
this from the statistics.
{noformat}
Mean     40  Samples=2000000  log(n!)= 15336   ( 0.00767 / sample)  min=    11  
max=    77  mean=    42.676  SD= 14.474   Poisson SD=  6.325
Mean     45  Samples=2000000  log(n!)= 13343   ( 0.00667 / sample)  min=    13  
max=    85  mean=    47.662  SD= 15.252   Poisson SD=  6.708
Mean     50  Samples=2000000  log(n!)= 11930   ( 0.00597 / sample)  min=    17  
max=    92  mean=    52.952  SD= 15.990   Poisson SD=  7.071
Mean     55  Samples=2000000  log(n!)= 10748   ( 0.00537 / sample)  min=    18  
max=   100  mean=    57.597  SD= 16.894   Poisson SD=  7.416
Mean     60  Samples=2000000  log(n!)=  9723   ( 0.00486 / sample)  min=    24  
max=   108  mean=    62.777  SD= 17.525   Poisson SD=  7.746
Mean     65  Samples=2000000  log(n!)=  9005   ( 0.00450 / sample)  min=    31  
max=   110  mean=    67.828  SD= 18.253   Poisson SD=  8.062
Mean     70  Samples=2000000  log(n!)=  8166   ( 0.00408 / sample)  min=    34  
max=   118  mean=    72.434  SD= 19.057   Poisson SD=  8.367
Mean     75  Samples=2000000  log(n!)=  7694   ( 0.00385 / sample)  min=    38  
max=   129  mean=    77.513  SD= 19.754   Poisson SD=  8.660
Mean     80  Samples=2000000  log(n!)=  6989   ( 0.00349 / sample)  min=    38  
max=   130  mean=    82.771  SD= 20.280   Poisson SD=  8.944
Mean     85  Samples=2000000  log(n!)=  6651   ( 0.00333 / sample)  min=    47  
max=   134  mean=    87.970  SD= 20.826   Poisson SD=  9.220
Mean     90  Samples=2000000  log(n!)=  6314   ( 0.00316 / sample)  min=    47  
max=   144  mean=    92.971  SD= 21.382   Poisson SD=  9.487
Mean     95  Samples=2000000  log(n!)=  5948   ( 0.00297 / sample)  min=    49  
max=   145  mean=    97.510  SD= 22.125   Poisson SD=  9.747
Mean    100  Samples=2000000  log(n!)=  5574   ( 0.00279 / sample)  min=    56  
max=   151  mean=   102.793  SD= 22.700   Poisson SD= 10.000
Mean    200  Samples=2000000  log(n!)=  2654   ( 0.00133 / sample)  min=   139  
max=   267  mean=   203.653  SD= 31.752   Poisson SD= 14.142
Mean    400  Samples=2000000  log(n!)=  1316   (0.000658 / sample)  min=   315  
max=   504  mean=   403.356  SD= 44.985   Poisson SD= 20.000
Mean    800  Samples=2000000  log(n!)=   657   (0.000329 / sample)  min=   691  
max=   947  mean=   803.128  SD= 63.848   Poisson SD= 28.284
Mean   1600  Samples=2000000  log(n!)=   336   (0.000168 / sample)  min=  1442  
max=  1776  mean=  1604.661  SD= 86.911   Poisson SD= 40.000
Mean   3200  Samples=2000000  log(n!)=   155   (7.75e-05 / sample)  min=  2977  
max=  3429  mean=  3210.290  SD=124.339   Poisson SD= 56.569
Mean   6400  Samples=2000000  log(n!)=    74   (3.70e-05 / sample)  min=  6088  
max=  6699  mean=  6378.473  SD=174.875   Poisson SD= 80.000
Mean  12800  Samples=2000000  log(n!)=    58   (2.90e-05 / sample)  min= 12335  
max= 13135  mean= 12803.776  SD=256.029   Poisson SD=113.137
Mean  25600  Samples=2000000  log(n!)=    19   (9.50e-06 / sample)  min= 25102  
max= 26111  mean= 25542.053  SD=358.731   Poisson SD=160.000
Mean  51200  Samples=2000000  log(n!)=    10   (5.00e-06 / sample)  min= 50301  
max= 51893  mean= 51056.600  SD=605.054   Poisson SD=226.274
Mean 102400  Samples=2000000  log(n!)=     3   (1.50e-06 / sample)  min=101510  
max=103388  mean=102729.667  SD=1057.375   Poisson SD=320.000
{noformat}
Interestingly the section of code that uses {{log(n!)}} is called about 0.7 
down to 0.3 % of the time. So the value of a cache of size 80 is debatable.

Reading the algorithm it is clear that the {{log(n!)}} is called with a value 
that is approximately the same as the sample that will be returned (it is just 
adjusted by a Poisson sample using a mean in the range of 0-1). Thus 
{{log(n!)}} is called with a value from the Poisson distribution. With high 
mean (>40) this is approximately a Gaussian and so the value of n will will be 
approximately in the range {{mean +/- 3*sqrt(mean)}}, with {{sqrt(mean)}} the 
expected standard deviation of the Poisson distribution. In the results the 
mean is slightly higher than the Poisson mean but the standard deviation is 
more than double. This is due the sub-sample of the Poisson that is used when 
calling {{log(n!)}}, i.e. this executes only part of the time for all samples.

If a cache is to be used then it could be made smarter and only the range of n 
expected given the mean should be cached. This will have some value when the 
user wants to compute thousands of samples, but the exact crossover point when 
a cache is beneficial would require use-case testing.

In light of these results perhaps the cache should be removed or made optional.

Given that a lot of the initial computation at the start of the main algorithm 
loop is the same I have created a new {{NoCachePoissonSampler}} that 
precomputes this. It actually delegates the work of sampling to an internal 
class that is specialised for small mean or large mean. This is less readable 
than the original but runs about 2x faster for big mean and is comparable for 
small mean to the existing version. It has little penalty when run inside a 
loop for a single use.
{noformat}
Mean 10  Single construction (33017430) vs Loop construction                    
      (43150013)   (1.306886.2x faster)
Mean 10  Single construction (33017430) vs Loop construction with no cache      
      (38633093)   (1.170082.2x faster)
Mean 10  Single construction (33017430) vs Single construction with no cache    
      (29106732)   (0.881557.2x faster)
Mean 15  Single construction (44054388) vs Loop construction                    
      (49272725)   (1.118452.2x faster)
Mean 15  Single construction (44054388) vs Loop construction with no cache      
      (47975491)   (1.089006.2x faster)
Mean 15  Single construction (44054388) vs Single construction with no cache    
      (40819826)   (0.926578.2x faster)
Mean 20  Single construction (49668097) vs Loop construction                    
      (56182968)   (1.131168.2x faster)
Mean 20  Single construction (49668097) vs Loop construction with no cache      
      (52115198)   (1.049269.2x faster)
Mean 20  Single construction (49668097) vs Single construction with no cache    
      (46318785)   (0.932566.2x faster)
Mean 25  Single construction (59471099) vs Loop construction                    
      (64330381)   (1.081708.2x faster)
Mean 25  Single construction (59471099) vs Loop construction with no cache      
      (60546587)   (1.018084.2x faster)
Mean 25  Single construction (59471099) vs Single construction with no cache    
      (54591773)   (0.917955.2x faster)
Mean 30  Single construction (68146512) vs Loop construction                    
      (74374546)   (1.091392.2x faster)
Mean 30  Single construction (68146512) vs Loop construction with no cache      
      (70451976)   (1.033831.2x faster)
Mean 30  Single construction (68146512) vs Single construction with no cache    
      (63940806)   (0.938284.2x faster)
Mean 35  Single construction (79878953) vs Loop construction                    
      (84966196)   (1.063687.2x faster)
Mean 35  Single construction (79878953) vs Loop construction with no cache      
      (87798333)   (1.099142.2x faster)
Mean 35  Single construction (79878953) vs Single construction with no cache    
      (89916831)   (1.125664.2x faster)
Mean 40  Single construction (44851418) vs Loop construction                    
      (150102896)   (3.346670.2x faster)
Mean 40  Single construction (44851418) vs Loop construction with no cache      
      (64463988)   (1.437279.2x faster)
Mean 40  Single construction (44851418) vs Single construction with no cache    
      (20547465)   (0.458123.2x faster)
Mean 45  Single construction (44565349) vs Loop construction                    
      (142062133)   (3.187726.2x faster)
Mean 45  Single construction (44565349) vs Loop construction with no cache      
      (56858335)   (1.275842.2x faster)
Mean 45  Single construction (44565349) vs Single construction with no cache    
      (20147275)   (0.452084.2x faster)
Mean 50  Single construction (43771462) vs Loop construction                    
      (146827129)   (3.354403.2x faster)
Mean 50  Single construction (43771462) vs Loop construction with no cache      
      (53001835)   (1.210877.2x faster)
Mean 50  Single construction (43771462) vs Single construction with no cache    
      (19761750)   (0.451476.2x faster)
Mean 55  Single construction (43445931) vs Loop construction                    
      (163112456)   (3.754378.2x faster)
Mean 55  Single construction (43445931) vs Loop construction with no cache      
      (52989752)   (1.219671.2x faster)
Mean 55  Single construction (43445931) vs Single construction with no cache    
      (19555998)   (0.450123.2x faster)
Mean 60  Single construction (43694078) vs Loop construction                    
      (166955616)   (3.821012.2x faster)
Mean 60  Single construction (43694078) vs Loop construction with no cache      
      (53799074)   (1.231267.2x faster)
Mean 60  Single construction (43694078) vs Single construction with no cache    
      (19645395)   (0.449612.2x faster)
Mean 65  Single construction (43766713) vs Loop construction                    
      (179079936)   (4.091693.2x faster)
Mean 65  Single construction (43766713) vs Loop construction with no cache      
      (55747299)   (1.273737.2x faster)
Mean 65  Single construction (43766713) vs Single construction with no cache    
      (20493758)   (0.468250.2x faster)
Mean 70  Single construction (43975538) vs Loop construction                    
      (189005650)   (4.297972.2x faster)
Mean 70  Single construction (43975538) vs Loop construction with no cache      
      (53124058)   (1.208037.2x faster)
Mean 70  Single construction (43975538) vs Single construction with no cache    
      (19395643)   (0.441055.2x faster)
Mean 75  Single construction (45913218) vs Loop construction                    
      (205133974)   (4.467863.2x faster)
Mean 75  Single construction (45913218) vs Loop construction with no cache      
      (55565851)   (1.210236.2x faster)
Mean 75  Single construction (45913218) vs Single construction with no cache    
      (19380347)   (0.422108.2x faster)
Mean 80  Single construction (43494472) vs Loop construction                    
      (204741747)   (4.707305.2x faster)
Mean 80  Single construction (43494472) vs Loop construction with no cache      
      (53357258)   (1.226760.2x faster)
Mean 80  Single construction (43494472) vs Single construction with no cache    
      (19207700)   (0.441612.2x faster)
{noformat}

> PoissonSampler single use speed improvements
> --------------------------------------------
>
>                 Key: RNG-50
>                 URL: https://issues.apache.org/jira/browse/RNG-50
>             Project: Commons RNG
>          Issue Type: Improvement
>    Affects Versions: 1.0
>            Reporter: Alex D Herbert
>            Priority: Minor
>         Attachments: PoissonSamplerTest.java
>
>
> The Sampler architecture of {{org.apache.commons.rng.sampling.distribution}} 
> is nicely written for fast sampling of small dataset sizes. The constructors 
> for the samplers do not check the input parameters are valid for the 
> respective distributions (in contrast to the old 
> {{org.apache.commons.math3.random.distribution}} classes). I assume this is a 
> design choice for speed. Thus most of the samplers can be used within a loop 
> to sample just one value with very little overhead.
> The {{PoissonSampler}} precomputes log factorial numbers upon construction if 
> the mean is above 40. This is done using the {{InternalUtils.FactorialLog}} 
> class. As of version 1.0 this internal class is currently only used in the 
> {{PoissonSampler}}.
> The cache size is limited to 2*PIVOT (where PIVOT=40). But it creates and 
> precomputes the cache every time a PoissonSampler is constructed if the mean 
> is above the PIVOT value.
> Why not create this once in a static block for the PoissonSampler?
> {code:java}
> /** {@code log(n!)}. */
> private static final FactorialLog factorialLog;
>      
> static 
> {
>     factorialLog = FactorialLog.create().withCache((int) (2 * 
> PoissonSampler.PIVOT));
> }
> {code}
> This will make the construction cost of a new {{PoissonSampler}} negligible. 
> If the table is computed dynamically as a static construction method then the 
> overhead will be in the first use. Thus the following call will be much 
> faster:
> {code:java}
> UniformRandomProvider rng = ...;
> int value = new PoissonSampler(rng, 50).sample();
> {code}
> I have tested this modification (see attached file) and the results are:
> {noformat}
> Mean 40  Single construction ( 7330792) vs Loop construction                  
>         (24334724)   (3.319522.2x faster)
> Mean 40  Single construction ( 7330792) vs Loop construction with static 
> FactorialLog ( 7990656)   (1.090013.2x faster)
> Mean 50  Single construction ( 6390303) vs Loop construction                  
>         (19389026)   (3.034132.2x faster)
> Mean 50  Single construction ( 6390303) vs Loop construction with static 
> FactorialLog ( 6146556)   (0.961857.2x faster)
> Mean 60  Single construction ( 6041165) vs Loop construction                  
>         (21337678)   (3.532047.2x faster)
> Mean 60  Single construction ( 6041165) vs Loop construction with static 
> FactorialLog ( 5329129)   (0.882136.2x faster)
> Mean 70  Single construction ( 6064003) vs Loop construction                  
>         (23963516)   (3.951765.2x faster)
> Mean 70  Single construction ( 6064003) vs Loop construction with static 
> FactorialLog ( 5306081)   (0.875013.2x faster)
> Mean 80  Single construction ( 6064772) vs Loop construction                  
>         (26381365)   (4.349935.2x faster)
> Mean 80  Single construction ( 6064772) vs Loop construction with static 
> FactorialLog ( 6341274)   (1.045591.2x faster)
> {noformat}
> Thus the speed improvements would be approximately 3-4 fold for single use 
> Poisson sampling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (RNG-50) PoissonSampler single use speed improvements

Reply via email to