[jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution

Otmar Ertl (JIRA) Mon, 27 Apr 2015 08:12:19 -0700

    [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514257#comment-14514257
 ]


Otmar Ertl edited comment on MATH-1220 at 4/27/15 3:10 PM:
-----------------------------------------------------------

Do you mean using a pre-calculated CDF table and sampling using binary search? 
This would have O(N) initialization costs, O(N) space costs, and sampling would 
require O(log(N)) time. All that is constant for the method I have proposed. If 
initialization and memory costs are not an issue, sampling using such a 
pre-calculated table for the CDF is probably faster for smaller N. 

The proposed rejection sampling algorithm divides the support into two parts. 
The head which currently only consists of the first point (equal to 1) and the 
tail which consists of all remaining points. Rejection sampling is only used to 
sample points from the tail. Instead, it would be possible to take the first X 
(<=N) points as the head and to use such a pre-calculated CDF table in 
combination with binary serach for sampling points from the head. For the tail 
still the rejection method can be applied. When developing that algorithm I had 
that in mind, but finally decided to set X = 1 to minimize the initialization 
costs. The optimal value for X is likely to be use case dependent and difficult 
to determine. I do not think that a pure CDF table based approach (X=N) is 
feasible for all N, because initialization and memory costs could get very 
large.


was (Author: otmar ertl):
Do you mean using a pre-calculated CDF tanle and sampling using binary search? 
This would have O(N) initialization costs, O(N) space costs, and sampling would 
require O(log(N)) time. All that is constant for the method I have proposed. If 
initialization and memory costs are not an issue, sampling using such a 
pre-calculated table for the CDF is probably faster for smaller N. 

The proposed rejection sampling algorithm divides the support into two parts. 
The head which currently only consists of the first point (equal to 1) and the 
tail which consists of all remaining points. Rejection sampling is only used to 
sample points from the tail. Instead, it would be possible to take the first X 
(<=N) points as the head and to use such a pre-calculated CDF table in 
combination with binary serach for sampling points from the head. For the tail 
still the rejection method can be applied. When developing that algorithm I had 
that in mind, but finally decided to set X = 1 to minimize the initialization 
costs. The optimal value for X is likely to be use case dependent and difficult 
to determine. I do not think that a pure CDF table based approach (X=N) is 
feasible for all N, because initialization and memory costs could get very 
large.

> More efficient sample() method for ZipfDistribution
> ---------------------------------------------------
>
>                 Key: MATH-1220
>                 URL: https://issues.apache.org/jira/browse/MATH-1220
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Otmar Ertl
>         Attachments: patch_v1
>
>
> Currently, sampling from a ZipfDistribution is very inefficient. Random 
> values are generated by inverting the CDF. However, the current 
> implementation uses O(N) power function evaluations to calculate the CDF for 
> some point. (Here N is the number of points of the Zipf distribution.) I 
> propose to use rejection sampling instead, which allows the generation of a 
> single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution

Reply via email to