[
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514257#comment-14514257
]
Otmar Ertl edited comment on MATH-1220 at 4/27/15 3:10 PM:
-----------------------------------------------------------
Do you mean using a pre-calculated CDF table and sampling using binary search?
This would have O(N) initialization costs, O(N) space costs, and sampling would
require O(log(N)) time. All that is constant for the method I have proposed. If
initialization and memory costs are not an issue, sampling using such a
pre-calculated table for the CDF is probably faster for smaller N.
The proposed rejection sampling algorithm divides the support into two parts.
The head which currently only consists of the first point (equal to 1) and the
tail which consists of all remaining points. Rejection sampling is only used to
sample points from the tail. Instead, it would be possible to take the first X
(<=N) points as the head and to use such a pre-calculated CDF table in
combination with binary serach for sampling points from the head. For the tail
still the rejection method can be applied. When developing that algorithm I had
that in mind, but finally decided to set X = 1 to minimize the initialization
costs. The optimal value for X is likely to be use case dependent and difficult
to determine. I do not think that a pure CDF table based approach (X=N) is
feasible for all N, because initialization and memory costs could get very
large.
was (Author: otmar ertl):
Do you mean using a pre-calculated CDF tanle and sampling using binary search?
This would have O(N) initialization costs, O(N) space costs, and sampling would
require O(log(N)) time. All that is constant for the method I have proposed. If
initialization and memory costs are not an issue, sampling using such a
pre-calculated table for the CDF is probably faster for smaller N.
The proposed rejection sampling algorithm divides the support into two parts.
The head which currently only consists of the first point (equal to 1) and the
tail which consists of all remaining points. Rejection sampling is only used to
sample points from the tail. Instead, it would be possible to take the first X
(<=N) points as the head and to use such a pre-calculated CDF table in
combination with binary serach for sampling points from the head. For the tail
still the rejection method can be applied. When developing that algorithm I had
that in mind, but finally decided to set X = 1 to minimize the initialization
costs. The optimal value for X is likely to be use case dependent and difficult
to determine. I do not think that a pure CDF table based approach (X=N) is
feasible for all N, because initialization and memory costs could get very
large.
> More efficient sample() method for ZipfDistribution
> ---------------------------------------------------
>
> Key: MATH-1220
> URL: https://issues.apache.org/jira/browse/MATH-1220
> Project: Commons Math
> Issue Type: Improvement
> Reporter: Otmar Ertl
> Attachments: patch_v1
>
>
> Currently, sampling from a ZipfDistribution is very inefficient. Random
> values are generated by inverting the CDF. However, the current
> implementation uses O(N) power function evaluations to calculate the CDF for
> some point. (Here N is the number of points of the Zipf distribution.) I
> propose to use rejection sampling instead, which allows the generation of a
> single random value in constant time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)