[
https://issues.apache.org/jira/browse/STATISTICS-70?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Herbert resolved STATISTICS-70.
------------------------------------
Fix Version/s: 1.1
Resolution: Implemented
Added in commit 584bf8966b999e542d389cbe7f8f76516d5dbacf
It should be noted that this optimises the summation of the PMF used in the
cumulative probability sums. However further performance improvements would be
made by caching the PMF values. This would require an array size of:
{noformat}
N = population size
K = number of successes
n = sample size
lower = max(0, K - (N - n));
upper = min(n, K)
size = upper - lower + 1
~ min(n, K){noformat}
The cache could be lazily evaluated but the array allocation would be fixed on
creation. Such functionality could be implemented using a method on the
instance:
{code:java}
public HypergeometricDistribution withCache();
HypergeometricDistribution dist = HypergeometricDistribution.of(N, K, n)
.withCache(); {code}
> Improve the CDF of the Hypergeometric distribution
> --------------------------------------------------
>
> Key: STATISTICS-70
> URL: https://issues.apache.org/jira/browse/STATISTICS-70
> Project: Commons Statistics
> Issue Type: Improvement
> Components: distribution
> Affects Versions: 1.0
> Reporter: Alex Herbert
> Priority: Minor
> Fix For: 1.1
>
>
> The hypergeometric distribution computes the CDF and the survival function
> (SF) using a summation of the PDF. This can be improved by caching a midpoint
> and only summing a choice of the lower or upper section. The complement can
> be used to compute the function in the other domain, e.g: CDF = 1 - SF.
> Other functions can also exploit this summation:
> * The probability(x0, x1) function can be performed using a summation of the
> range (x0, x1]. Currently it uses the default implementation which is CDF(x1)
> - CDF(x0). This will duplicate part of the summation of the range (i.e. up to
> x0).
> * The inverse CDF and inverse SF use the default implementation of a
> bracketed bisection search of the CDF or SF. This can be updated to simply
> sum the PDF until the target CDF / SF is obtained. This effectively changes
> the function to a single call to the smaller of CDF or SF to find the target
> quantile.
> The midpoint could be the median (CDF ~ SF ~ 0.5) which requires computation,
> or the mode which is floor((n+1)(K+1)/(N+2)). From a look at example density
> functions the two values should be similar (see [Hypergeometric distribution
> (Wikipedia)|https://en.wikipedia.org/wiki/Hypergeometric_distribution]).
> However to ensure strict inversion the p-value would also be required for the
> midpoint so the inverse implementation can correctly switch the choice of
> which function to invert.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)