[jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution

Thomas Neidhart (JIRA) Wed, 29 Apr 2015 22:38:53 -0700

    [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520897#comment-14520897
 ]


Thomas Neidhart edited comment on MATH-1220 at 4/30/15 5:37 AM:
----------------------------------------------------------------

{quote}
Furthermore, I would allow the exponent to be non-negative. Currently, it is 
restricted to positive values.
{quote}

I am not sure if we should do this. Normally in literature and implementations, 
the exponent is even further restricted (exp > 1), and if you look at the pmf 
and cdf functions for exponents < 1 you can see that the resulting distribution 
is quite distorted. I guess the typical implementation uses the Riemann zeta 
function which does not converge for exponents < 1.

btw. there is something in your implementation that I do not understand:

{code}
     * An instrumental distribution g(k) is used to generate random values by 
rejection sampling.
     * g(k) is defined as g(1):= 1 and g(k) := I(-s,k-1/2,k+1/2) for k larger 
than 1,
     * where s denotes the exponent of the Zipf distribution and I(r,a,b) is 
the integral of x^r for x from a to b.
     * Since 1^x^s is a convex function, Jensens's inequality gives
     * I(-s,k-1/2,k+1/2) >= 1/k^s for all positive k and non-negative s.
     * In order to limit the rejection rate for large exponents s,
     * the instrumental distribution weight is differently defined for value 1.
     */
    @Override
    public int sample() {
        if (Double.isNaN(instrumentalDistributionTailWeight)) {
            instrumentalDistributionTailWeight = 
integratePowerFunction(-exponent, 1.5, numberOfElements+0.5);
        }
{code}

talks about integrating the power function in the range [k-0.5, k+0.5] but in 
fact uses 1.5 and k+0.5 as bounds. Where does the 1.5 come from?


was (Author: tn):
{comment}
Furthermore, I would allow the exponent to be non-negative. Currently, it is 
restricted to positive values.
{comment}

I am not sure if we should do this. Normally in literature and implementations, 
the exponent is even further restricted (exp > 1), and if you look at the pmf 
and cdf functions for exponents < 1 you can see that the resulting distribution 
is quite distorted. I guess the typical implementation uses the Riemann zeta 
function which does not converge for exponents < 1.

btw. there is something in your implementation that I do not understand:

{code}
     * An instrumental distribution g(k) is used to generate random values by 
rejection sampling.
     * g(k) is defined as g(1):= 1 and g(k) := I(-s,k-1/2,k+1/2) for k larger 
than 1,
     * where s denotes the exponent of the Zipf distribution and I(r,a,b) is 
the integral of x^r for x from a to b.
     * Since 1^x^s is a convex function, Jensens's inequality gives
     * I(-s,k-1/2,k+1/2) >= 1/k^s for all positive k and non-negative s.
     * In order to limit the rejection rate for large exponents s,
     * the instrumental distribution weight is differently defined for value 1.
     */
    @Override
    public int sample() {
        if (Double.isNaN(instrumentalDistributionTailWeight)) {
            instrumentalDistributionTailWeight = 
integratePowerFunction(-exponent, 1.5, numberOfElements+0.5);
        }
{code}

talks about integrating the power function in the range [k-0.5, k+0.5] but in 
fact uses 1.5 and k+0.5 as bounds. Where does the 1.5 come from?

> More efficient sample() method for ZipfDistribution
> ---------------------------------------------------
>
>                 Key: MATH-1220
>                 URL: https://issues.apache.org/jira/browse/MATH-1220
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Otmar Ertl
>         Attachments: patch_v1
>
>
> Currently, sampling from a ZipfDistribution is very inefficient. Random 
> values are generated by inverting the CDF. However, the current 
> implementation uses O(N) power function evaluations to calculate the CDF for 
> some point. (Here N is the number of points of the Zipf distribution.) I 
> propose to use rejection sampling instead, which allows the generation of a 
> single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution

Reply via email to