[jira] [Comment Edited] (RNG-50) PoissonSampler single use speed improvements

Alex D Herbert (JIRA) Wed, 01 Aug 2018 07:38:26 -0700


    [ 
https://issues.apache.org/jira/browse/RNG-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565413#comment-16565413
 ]


Alex D Herbert edited comment on RNG-50 at 8/1/18 2:38 PM:
-----------------------------------------------------------

Sorry about the link. I've updated it.
{quote}If so, LargeMeanPoissonSamplerCache should be thread-safe
{quote}
I don't think this pattern requires thread safety.

The lazy initialisation is such that all that is shared across all the threads 
is the fixed length array of cached states. If the state is null then the state 
is computed and put into the array. The method always keeps a reference to the 
computed state so it has a guaranteed return value.

If multiple threads each try and access the same null state then they will all 
compute it. Then each thread will put it into the array, replacing a value that 
may have been computed by another thread. However this will just be a 
duplication with the same underlying values.

If the write to the array is done via the local memory cache then no other 
threads running using different local cache will see it. However it will 
eventually be synchronised back to main memory and then out to any other CPU 
cache memory.

So you may get some duplication of computation across threads. But avoid any 
synchronisation issues.

The alternative is to go for synchronisation when the state is null. This 
should be done using a synchronisation on the level of the index that is 
missing. So allowing other threads to synchronise and update a different 
missing value. This could use 
{{java.util.concurrent.atomic.AtomicReferenceArray<E>}}.
{code:java}
package org.apache.commons.rng.sampling.distribution;

import java.util.concurrent.atomic.AtomicReferenceArray;

import org.apache.commons.rng.sampling.distribution.InternalUtils.FactorialLog;

/**
 * Compute initialisation state for the LargeMeanPoissonSampler caching values
 * in a set range.
 */
public class LargeMeanPoissonSamplerCache {

    /** Class to compute {@code log(n!)}. This has no cached values. */
    private static final InternalUtils.FactorialLog NO_CACHE_FACTORIAL_LOG;

    static {
        NO_CACHE_FACTORIAL_LOG = FactorialLog.create();
    }

    /** The minimum N covered by the cache. */
    private final int minN;
    /** The maximum N covered by the cache. */
    private final int maxN;
    /** The cache of states between {@link minN} and {@link maxN}. */
    private final AtomicReferenceArray<double[]> values;

    /**
     * @param minN The minimum N covered by the cache.
     * @param maxN The maximum N covered by the cache.
     * @throws IllegalArgumentException if {@code mean <= 0}.
     */
    public LargeMeanPoissonSamplerCache(int minN, int maxN) {
        if (minN < 0) {
            throw new IllegalArgumentException(
                    "MinN: " + minN + " <= " + 0);
        }
        if (maxN <= minN) {
            throw new IllegalArgumentException(
                    "MaxN: " + maxN + " <= " + minN);
        }
        this.minN = minN;
        this.maxN = maxN;
        values = new AtomicReferenceArray<double[]>(maxN - minN + 1);
    }

    /**
     * Get the initialisation state of the LargeMeanPoissonSampler.
     *
     * @param lambda The integer value of the Poisson mean
     *               ({@code Math.floor(mean)}).
     * @return the initialisation state
     */
    double[] getState(int lambda) {
        if (lambda < minN || lambda > maxN)
            return computeState(lambda);
        final int index = lambda - minN;
        double[] value = values.get(index);
        if (value == null) {
            // Compute and store for reuse
            value = computeState(lambda);
            values.lazySet(index, value);
            // or
            values.compareAndSet(index, null, value);
        }
        return value;
    }

    /**
     * compute the initialisation state of the LargeMeanPoissonSampler.
     *
     * @param lambda The integer value of the Poisson mean
     *               ({@code Math.floor(mean)}).
     * @return the initialisation state
     */
    private final static double[] computeState(int lambda) {
        final double logLambda = Math.log(lambda);
        final double logLambdaFactorial = 
                NO_CACHE_FACTORIAL_LOG.value(lambda);
        final double delta = Math.sqrt(lambda * 
                Math.log(32 * lambda / Math.PI + 1));
        final double halfDelta = delta / 2;
        final double twolpd = 2 * lambda + delta;
        final double c1 = 1 / (8 * lambda);
        final double a1 = Math.sqrt(Math.PI * twolpd) * Math.exp(c1);
        final double a2 = (twolpd / delta) * 
                Math.exp(-delta * (1 + delta) / twolpd);
        final double aSum = a1 + a2 + 1;
        final double p1 = a1 / aSum;
        final double p2 = a2 / aSum;
        return new double[] { logLambda, logLambdaFactorial, delta, 
                halfDelta, twolpd, p1, p2, c1 };
    }
}
{code}
I've never bothered to do this as the suggested pattern (leave the JVM to sort 
out main memory synchronisation) has always worked for me. I've used this on 
machines running 16-cores and obtained the same results as a single core.

I could have an investigation by adding an extension to the benchmark I 
currently have.


was (Author: alexherbert):
Sorry about the link. I've updated it.
{quote}If so, LargeMeanPoissonSamplerCache should be thread-safe
{quote}
I don't think this pattern requires thread safety.

The lazy initialisation is such that all that is shared across all the threads 
is the fixed length array of cached states. If the state is null then the state 
is computed and put into the array. The method always keeps a reference to the 
computed state so it has a guaranteed return value.

If multiple threads each try and access the same null state then they will all 
compute it. Then each thread will put it into the array, replacing a value that 
may have been computed by another thread. However this will just be a 
duplication with the same underlying values.

If the write to the array is done via the local memory cache then no other 
threads running using different local cache will see it. However it will 
eventually be synchronised back to main memory and then out to any other CPU 
cache memory.

So you may get some duplication of computation across threads. But avoid any 
synchronisation issues.

The alternative is to go for synchronisation when the state is null. This 
should be done using a synchronisation on the level of the index that is 
missing. So allowing other threads to synchronise and update a different 
missing value. This could use 
{{java.util.concurrent.atomic.AtomicReferenceArray<E>}}.
{code:java}
package org.apache.commons.rng.sampling.distribution;

import java.util.concurrent.atomic.AtomicReferenceArray;

import org.apache.commons.rng.sampling.distribution.InternalUtils.FactorialLog;

/**
 * Compute initialisation state for the LargeMeanPoissonSampler caching values
 * in a set range.
 */
public class LargeMeanPoissonSamplerCache {

    /** Class to compute {@code log(n!)}. This has no cached values. */
    private static final InternalUtils.FactorialLog NO_CACHE_FACTORIAL_LOG;

    static {
        NO_CACHE_FACTORIAL_LOG = FactorialLog.create();
    }

    /** The minimum N covered by the cache. */
    private final int minN;
    /** The maximum N covered by the cache. */
    private final int maxN;
    /** The cache of initialisation states between {@link minN} and {@link 
maxN}. */
    private final AtomicReferenceArray<double[]> values;

    /**
     * @param minN The minimum N covered by the cache.
     * @param maxN The maximum N covered by the cache.
     * @throws IllegalArgumentException if {@code mean <= 0}.
     */
    public LargeMeanPoissonSamplerCache(int minN, int maxN) {
        if (minN < 0) {
            throw new IllegalArgumentException("MinN: " + minN + " <= " + 0);
        }
        if (maxN <= minN) {
            throw new IllegalArgumentException("MaxN: " + maxN + " <= " + minN);
        }
        this.minN = minN;
        this.maxN = maxN;
        values = new AtomicReferenceArray<double[]>(maxN - minN + 1);
    }

    /**
     * Get the initialisation state of the LargeMeanPoissonSampler.
     *
     * @param lambda The integer value of the Poisson mean ({@code 
Math.floor(mean)}).
     * @return the initialisation state
     */
    double[] getState(int lambda) {
        if (lambda < minN || lambda > maxN)
            return computeState(lambda);
        final int index = lambda - minN;
        double[] value = values.get(index);
        if (value == null) {
            // Compute and store for reuse
            value = computeState(lambda);
            values.compareAndSet(index, null, value);
            // or values.lazySet(index, value);
        }
        return value;
    }
    
    /**
     * compute the initialisation state of the LargeMeanPoissonSampler.
     *
     * @param lambda The integer value of the Poisson mean ({@code 
Math.floor(mean)}).
     * @return the initialisation state
     */
    private final static double[] computeState(int lambda) {
        final double logLambda = Math.log(lambda);
        final double logLambdaFactorial = NO_CACHE_FACTORIAL_LOG.value(lambda);
        final double delta = Math.sqrt(lambda * Math.log(32 * lambda / Math.PI 
+ 1));
        final double halfDelta = delta / 2;
        final double twolpd = 2 * lambda + delta;
        final double c1 = 1 / (8 * lambda);
        final double a1 = Math.sqrt(Math.PI * twolpd) * Math.exp(c1);
        final double a2 = (twolpd / delta) * Math.exp(-delta * (1 + delta) / 
twolpd);
        final double aSum = a1 + a2 + 1;
        final double p1 = a1 / aSum;
        final double p2 = a2 / aSum;
        return new double[] { logLambda, logLambdaFactorial, delta, 
                halfDelta, twolpd, p1, p2, c1 };
    }
}
{code}
I've never bothered to do this as the suggested pattern (leave the JVM to sort 
out main memory synchronisation) has always worked for me. I've used this on 
machines running 16-cores and obtained the same results as a single core.

I could have an investigation by adding an extension to the benchmark I 
currently have.

> PoissonSampler single use speed improvements
> --------------------------------------------
>
>                 Key: RNG-50
>                 URL: https://issues.apache.org/jira/browse/RNG-50
>             Project: Commons RNG
>          Issue Type: Improvement
>    Affects Versions: 1.0
>            Reporter: Alex D Herbert
>            Priority: Minor
>         Attachments: PoissonSamplerTest.java, jmh-result.csv
>
>
> The Sampler architecture of {{org.apache.commons.rng.sampling.distribution}} 
> is nicely written for fast sampling of small dataset sizes. The constructors 
> for the samplers do not check the input parameters are valid for the 
> respective distributions (in contrast to the old 
> {{org.apache.commons.math3.random.distribution}} classes). I assume this is a 
> design choice for speed. Thus most of the samplers can be used within a loop 
> to sample just one value with very little overhead.
> The {{PoissonSampler}} precomputes log factorial numbers upon construction if 
> the mean is above 40. This is done using the {{InternalUtils.FactorialLog}} 
> class. As of version 1.0 this internal class is currently only used in the 
> {{PoissonSampler}}.
> The cache size is limited to 2*PIVOT (where PIVOT=40). But it creates and 
> precomputes the cache every time a PoissonSampler is constructed if the mean 
> is above the PIVOT value.
> Why not create this once in a static block for the PoissonSampler?
> {code:java}
> /** {@code log(n!)}. */
> private static final FactorialLog factorialLog;
>      
> static 
> {
>     factorialLog = FactorialLog.create().withCache((int) (2 * 
> PoissonSampler.PIVOT));
> }
> {code}
> This will make the construction cost of a new {{PoissonSampler}} negligible. 
> If the table is computed dynamically as a static construction method then the 
> overhead will be in the first use. Thus the following call will be much 
> faster:
> {code:java}
> UniformRandomProvider rng = ...;
> int value = new PoissonSampler(rng, 50).sample();
> {code}
> I have tested this modification (see attached file) and the results are:
> {noformat}
> Mean 40  Single construction ( 7330792) vs Loop construction                  
>         (24334724)   (3.319522.2x faster)
> Mean 40  Single construction ( 7330792) vs Loop construction with static 
> FactorialLog ( 7990656)   (1.090013.2x faster)
> Mean 50  Single construction ( 6390303) vs Loop construction                  
>         (19389026)   (3.034132.2x faster)
> Mean 50  Single construction ( 6390303) vs Loop construction with static 
> FactorialLog ( 6146556)   (0.961857.2x faster)
> Mean 60  Single construction ( 6041165) vs Loop construction                  
>         (21337678)   (3.532047.2x faster)
> Mean 60  Single construction ( 6041165) vs Loop construction with static 
> FactorialLog ( 5329129)   (0.882136.2x faster)
> Mean 70  Single construction ( 6064003) vs Loop construction                  
>         (23963516)   (3.951765.2x faster)
> Mean 70  Single construction ( 6064003) vs Loop construction with static 
> FactorialLog ( 5306081)   (0.875013.2x faster)
> Mean 80  Single construction ( 6064772) vs Loop construction                  
>         (26381365)   (4.349935.2x faster)
> Mean 80  Single construction ( 6064772) vs Loop construction with static 
> FactorialLog ( 6341274)   (1.045591.2x faster)
> {noformat}
> Thus the speed improvements would be approximately 3-4 fold for single use 
> Poisson sampling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (RNG-50) PoissonSampler single use speed improvements

Reply via email to