[
https://issues.apache.org/jira/browse/STATISTICS-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Herbert updated STATISTICS-69:
-----------------------------------
Description:
A 2x2 contingency table [[a, b], [c, d]] is used to visualize N independent
observations of two binary variables (G or g and H or h):
{noformat}
G g
-------
H | a b | m
h | c d | n
----------
s r | N{noformat}
The probability distributions are classified into 3 cases:
# The row and column sums are fixed in advance. All table entries are
determined by a. This follows a hypergeometric distribution with parameters N,
m, s.
# The row sums are fixed, but the column sums are not. All table entries are
determined by a and c. The distribution is a joint binomial distribution with
probabilities p0 and p1:
a ~ B(m, p0); c ~ B(n, p1)
# Only the total N is fixed (row and columns sums are not). The table (a, b,
c, d) is a multinomial distribution.
Case 1 is covered by using Fisher's exact test (see [STATISTICS 64]). It does
not occur in practice very often as the column and row sums are both fixed in
advance. This is an exact conditioned test (as it conditions on the row sums).
Case 2 is more common where the row sums are fixed but the columns are not. For
example a clinical trial with two groups of fixed size (e.g. medication or
placebo); the outcome of cure or no cure for each of the patients is unknown.
Case 3 is rare. For example flipping two coins N times and totalling the
heads/tails for each independently.
I propose adding a test that can handle an unconditioned exact test. Case 2 is
the more common and simpler to support. It involves generating a test statistic
for each possible table given the fixed totals. The p-value is obtained from a
subset of the possible test statistics that are more extreme that the observed
table. Alternatively the subset is maximised by incrementally adding candidates
based on which next sized subset has the smallest p-value. This is the CSM
(Convexity, Symmetry, Minimization) test of Barnard (1945). This is
computational expensive and benefits from precomputed tables which ranks the
order of tables for a given size (m,n). In either case the computation of the
p-value involves maximising the p-value given a nuisance parameter in the range
(0, 1).
Possible test statistics are Fisher's p-value for the table (known as
Boschloo's test (1970)), or using a Z-pooled or Z-Unpooled statistic.
Implementation of the CSM test is computationally intense.
There is a reference implementation in R as the Exact package:
[https://cran.r-project.org/web/packages/Exact/Exact.pdf]
SciPy has implementation of Boshloo's and the z-pooled/unpooled test (which
they name Barnard's test):
[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boschloo_exact.html]
[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.barnard_exact.html]
Note that the search for the nuisance parameter involves a univariate function
with multiple minima. The implementations in R and SciPy both use multiple
start points to find candidate locations for a search for a maxima. This is
done by using N uniform points in (0, 1) and then (optionally) optimising the
best candidate to find the maximum. The function requires numerical
differentiation and would be suitable for a non-derivative method such as Brent
optimisation for the univariate case.
See also:
[https://en.wikipedia.org/wiki/Boschloo%27s_test]
[https://en.wikipedia.org/wiki/Barnard%27s_test]
was:
A 2x2 contingency table [[a, b], [c, d]] is used to visualize N independent
observations of two binary variables (G or g and H or h):
{noformat}
G g
-------
H | a b | m
h | c d | n
----------
s r | N{noformat}
The probability distributions are classified into 3 cases:
# The row and column sums are fixed in advance. All table entries are
determined by a. This follows a hypergeometric distribution with parameters N,
m, s.
# The row sums are fixed, but the column sums are not. All table entries are
determined by a and c. The distribution is a join binomial distribution with
probabilities p0 and p1:
a ~ B(m, p0); b ~ B(n, p1)
# Only the total N is fixed (row and columns sums are not). The table (a, b,
c, d) is a multinomial distribution.
Case 1 is covered by using Fisher's exact test (see [STATISTICS 64]). It does
not occur in practice very often as the column and row sums are both fixed in
advance. This is an exact conditioned test (as it conditions on the row sums).
Case 2 is more common where the row sums are fixed but the columns are not. For
example a clinical trial with two groups of fixed size (e.g. medication or
placebo); the outcome of cure or no cure for each of the patients is unknown.
Case 3 is rare. For example flipping two coins N times and totalling the
heads/tails for each independently.
I propose adding a test that can handle an unconditioned exact test. Case 2 is
the more common and simpler to support. It involves generating a test statistic
for each possible table given the fixed totals. The p-value is obtained from a
subset of the possible test statistics that are more extreme that the observed
table. Alternatively the subset is maximised by incrementally adding candidates
based on which next sized subset has the smallest p-value. This is the CSM
(Convexity, Symmetry, Minimization) test of Barnard (1945). This is
computational expensive and benefits from precomputed tables which ranks the
order of tables for a given size (m,n). In either case the computation of the
p-value involves maximising the p-value given a nuisance parameter in the range
(0, 1).
Possible test statistics are Fisher's p-value for the table (known as
Boschloo's test (1970)), or using a Z-pooled or Z-Unpooled statistic.
Implementation of the CSM test is computationally intense.
There is a reference implementation in R as the Exact package:
[https://cran.r-project.org/web/packages/Exact/Exact.pdf]
SciPy has implementation of Boshloo's and the z-pooled/unpooled test (which
they name Barnard's test):
[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boschloo_exact.html]
[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.barnard_exact.html]
Note that the search for the nuisance parameter involves a univariate function
with multiple minima. The implementations in R and SciPy both use multiple
start points to find candidate locations for a search for a maxima. This is
done by using N uniform points in (0, 1) and then (optionally) optimising the
best candidate to find the maximum. The function requires numerical
differentiation and would be suitable for a non-derivative method such as Brent
optimisation for the univariate case.
See also:
[https://en.wikipedia.org/wiki/Boschloo%27s_test]
[https://en.wikipedia.org/wiki/Barnard%27s_test]
> Add an unconditioned exact test for 2x2 contingency tables
> ----------------------------------------------------------
>
> Key: STATISTICS-69
> URL: https://issues.apache.org/jira/browse/STATISTICS-69
> Project: Commons Statistics
> Issue Type: New Feature
> Components: inference
> Reporter: Alex Herbert
> Priority: Minor
> Fix For: 1.1
>
>
> A 2x2 contingency table [[a, b], [c, d]] is used to visualize N independent
> observations of two binary variables (G or g and H or h):
>
> {noformat}
> G g
> -------
> H | a b | m
> h | c d | n
> ----------
> s r | N{noformat}
> The probability distributions are classified into 3 cases:
> # The row and column sums are fixed in advance. All table entries are
> determined by a. This follows a hypergeometric distribution with parameters
> N, m, s.
> # The row sums are fixed, but the column sums are not. All table entries are
> determined by a and c. The distribution is a joint binomial distribution with
> probabilities p0 and p1:
> a ~ B(m, p0); c ~ B(n, p1)
> # Only the total N is fixed (row and columns sums are not). The table (a, b,
> c, d) is a multinomial distribution.
> Case 1 is covered by using Fisher's exact test (see [STATISTICS 64]). It does
> not occur in practice very often as the column and row sums are both fixed in
> advance. This is an exact conditioned test (as it conditions on the row sums).
> Case 2 is more common where the row sums are fixed but the columns are not.
> For example a clinical trial with two groups of fixed size (e.g. medication
> or placebo); the outcome of cure or no cure for each of the patients is
> unknown.
> Case 3 is rare. For example flipping two coins N times and totalling the
> heads/tails for each independently.
> I propose adding a test that can handle an unconditioned exact test. Case 2
> is the more common and simpler to support. It involves generating a test
> statistic for each possible table given the fixed totals. The p-value is
> obtained from a subset of the possible test statistics that are more extreme
> that the observed table. Alternatively the subset is maximised by
> incrementally adding candidates based on which next sized subset has the
> smallest p-value. This is the CSM (Convexity, Symmetry, Minimization) test of
> Barnard (1945). This is computational expensive and benefits from precomputed
> tables which ranks the order of tables for a given size (m,n). In either case
> the computation of the p-value involves maximising the p-value given a
> nuisance parameter in the range (0, 1).
> Possible test statistics are Fisher's p-value for the table (known as
> Boschloo's test (1970)), or using a Z-pooled or Z-Unpooled statistic.
> Implementation of the CSM test is computationally intense.
> There is a reference implementation in R as the Exact package:
> [https://cran.r-project.org/web/packages/Exact/Exact.pdf]
> SciPy has implementation of Boshloo's and the z-pooled/unpooled test (which
> they name Barnard's test):
> [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boschloo_exact.html]
> [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.barnard_exact.html]
> Note that the search for the nuisance parameter involves a univariate
> function with multiple minima. The implementations in R and SciPy both use
> multiple start points to find candidate locations for a search for a maxima.
> This is done by using N uniform points in (0, 1) and then (optionally)
> optimising the best candidate to find the maximum. The function requires
> numerical differentiation and would be suitable for a non-derivative method
> such as Brent optimisation for the univariate case.
> See also:
> [https://en.wikipedia.org/wiki/Boschloo%27s_test]
> [https://en.wikipedia.org/wiki/Barnard%27s_test]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)