[jira] [Resolved] (STATISTICS-70) Improve the CDF of the Hypergeometric distribution

Alex Herbert (Jira) Fri, 17 Feb 2023 11:00:06 -0800


     [ 
https://issues.apache.org/jira/browse/STATISTICS-70?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alex Herbert resolved STATISTICS-70.
------------------------------------
    Fix Version/s: 1.1
       Resolution: Implemented

Added in commit 584bf8966b999e542d389cbe7f8f76516d5dbacf

It should be noted that this optimises the summation of the PMF used in the 
cumulative probability sums. However further performance improvements would be 
made by caching the PMF values. This would require an array size of:
{noformat}
N = population size
K = number of successes
n = sample size

lower = max(0, K - (N - n));
upper = min(n, K)

size = upper - lower + 1
     ~ min(n, K){noformat}
The cache could be lazily evaluated but the array allocation would be fixed on 
creation. Such functionality could be implemented using a method on the 
instance:
{code:java}
public HypergeometricDistribution withCache();

HypergeometricDistribution dist = HypergeometricDistribution.of(N, K, n)
                                                            .withCache(); {code}

> Improve the CDF of the Hypergeometric distribution
> --------------------------------------------------
>
>                 Key: STATISTICS-70
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-70
>             Project: Commons Statistics
>          Issue Type: Improvement
>          Components: distribution
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Minor
>             Fix For: 1.1
>
>
> The hypergeometric distribution computes the CDF and the survival function 
> (SF) using a summation of the PDF. This can be improved by caching a midpoint 
> and only summing a choice of the lower or upper section. The complement can 
> be used to compute the function in the other domain, e.g: CDF = 1 - SF.
> Other functions can also exploit this summation:
>  * The probability(x0, x1) function can be performed using a summation of the 
> range (x0, x1]. Currently it uses the default implementation which is CDF(x1) 
> - CDF(x0). This will duplicate part of the summation of the range (i.e. up to 
> x0).
>  * The inverse CDF and inverse SF use the default implementation of a 
> bracketed bisection search of the CDF or SF. This can be updated to simply 
> sum the PDF until the target CDF / SF is obtained. This effectively changes 
> the function to a single call to the smaller of CDF or SF to find the target 
> quantile.
> The midpoint could be the median (CDF ~ SF ~ 0.5) which requires computation, 
> or the mode which is floor((n+1)(K+1)/(N+2)). From a look at example density 
> functions the two values should be similar (see [Hypergeometric distribution 
> (Wikipedia)|https://en.wikipedia.org/wiki/Hypergeometric_distribution]). 
> However to ensure strict inversion the p-value would also be required for the 
> midpoint so the inverse implementation can correctly switch the choice of 
> which function to invert.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (STATISTICS-70) Improve the CDF of the Hypergeometric distribution

Reply via email to