[jira] [Updated] (STATISTICS-69) Add an unconditioned exact test for 2x2 contingency tables

Alex Herbert (Jira) Tue, 21 Feb 2023 04:24:06 -0800


     [ 
https://issues.apache.org/jira/browse/STATISTICS-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alex Herbert updated STATISTICS-69:
-----------------------------------
    Description: 
A 2x2 contingency table [[a, b], [c, d]] is used to visualize N independent 
observations of two binary variables (G or g and H or h):

 
{noformat}
    G g
  -------
H | a b | m
h | c d | n
  ----------
    s r | N{noformat}
The probability distributions are classified into 3 cases:
 # The row and column sums are fixed in advance. All table entries are 
determined by a. This follows a hypergeometric distribution with parameters N, 
m, s.
 # The row sums are fixed, but the column sums are not. All table entries are 
determined by a and c. The distribution is a joint binomial distribution with 
probabilities p0 and p1:
a ~ B(m, p0); c ~ B(n, p1)
 # Only the total N is fixed (row and columns sums are not). The table (a, b, 
c, d) is a multinomial distribution.

Case 1 is covered by using Fisher's exact test (see [STATISTICS 64]). It does 
not occur in practice very often as the column and row sums are both fixed in 
advance. This is an exact conditioned test (as it conditions on the row sums).

Case 2 is more common where the row sums are fixed but the columns are not. For 
example a clinical trial with two groups of fixed size (e.g. medication or 
placebo); the outcome of cure or no cure for each of the patients is unknown.

Case 3 is rare. For example flipping two coins N times and totalling the 
heads/tails for each independently.

I propose adding a test that can handle an unconditioned exact test. Case 2 is 
the more common and simpler to support. It involves generating a test statistic 
for each possible table given the fixed totals. The p-value is obtained from a 
subset of the possible test statistics that are more extreme that the observed 
table. Alternatively the subset is maximised by incrementally adding candidates 
based on which next sized subset has the smallest p-value. This is the CSM 
(Convexity, Symmetry, Minimization) test of Barnard (1945). This is 
computational expensive and benefits from precomputed tables which ranks the 
order of tables for a given size (m,n). In either case the computation of the 
p-value involves maximising the p-value given a nuisance parameter in the range 
(0, 1).

Possible test statistics are Fisher's p-value for the table (known as 
Boschloo's test (1970)), or using a Z-pooled or Z-Unpooled statistic. 
Implementation of the CSM test is computationally intense.

There is a reference implementation in R as the Exact package:

[https://cran.r-project.org/web/packages/Exact/Exact.pdf]

SciPy has implementation of Boshloo's and the z-pooled/unpooled test (which 
they name Barnard's test):

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boschloo_exact.html]

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.barnard_exact.html]

Note that the search for the nuisance parameter involves a univariate function 
with multiple minima. The implementations in R and SciPy both use multiple 
start points to find candidate locations for a search for a maxima. This is 
done by using N uniform points in (0, 1) and then (optionally) optimising the 
best candidate to find the maximum. The function requires numerical 
differentiation and would be suitable for a non-derivative method such as Brent 
optimisation for the univariate case.

See also:

[https://en.wikipedia.org/wiki/Boschloo%27s_test]

[https://en.wikipedia.org/wiki/Barnard%27s_test]

 

  was:
A 2x2 contingency table [[a, b], [c, d]] is used to visualize N independent 
observations of two binary variables (G or g and H or h):

 
{noformat}
    G g
  -------
H | a b | m
h | c d | n
  ----------
    s r | N{noformat}
The probability distributions are classified into 3 cases:

 
 # The row and column sums are fixed in advance. All table entries are 
determined by a. This follows a hypergeometric distribution with parameters N, 
m, s.
 # The row sums are fixed, but the column sums are not. All table entries are 
determined by a and c. The distribution is a join binomial distribution with 
probabilities p0 and p1:
a ~ B(m, p0); b ~ B(n, p1)
 # Only the total N is fixed (row and columns sums are not). The table (a, b, 
c, d) is a multinomial distribution.

Case 1 is covered by using Fisher's exact test (see [STATISTICS 64]). It does 
not occur in practice very often as the column and row sums are both fixed in 
advance. This is an exact conditioned test (as it conditions on the row sums).

Case 2 is more common where the row sums are fixed but the columns are not. For 
example a clinical trial with two groups of fixed size (e.g. medication or 
placebo); the outcome of cure or no cure for each of the patients is unknown.

Case 3 is rare. For example flipping two coins N times and totalling the 
heads/tails for each independently.

I propose adding a test that can handle an unconditioned exact test. Case 2 is 
the more common and simpler to support. It involves generating a test statistic 
for each possible table given the fixed totals. The p-value is obtained from a 
subset of the possible test statistics that are more extreme that the observed 
table. Alternatively the subset is maximised by incrementally adding candidates 
based on which next sized subset has the smallest p-value. This is the CSM 
(Convexity, Symmetry, Minimization) test of Barnard (1945). This is 
computational expensive and benefits from precomputed tables which ranks the 
order of tables for a given size (m,n). In either case the computation of the 
p-value involves maximising the p-value given a nuisance parameter in the range 
(0, 1).

Possible test statistics are Fisher's p-value for the table (known as 
Boschloo's test (1970)), or using a Z-pooled or Z-Unpooled statistic. 
Implementation of the CSM test is computationally intense.

There is a reference implementation in R as the Exact package:

[https://cran.r-project.org/web/packages/Exact/Exact.pdf]

SciPy has implementation of Boshloo's and the z-pooled/unpooled test (which 
they name Barnard's test):

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boschloo_exact.html]

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.barnard_exact.html]

Note that the search for the nuisance parameter involves a univariate function 
with multiple minima. The implementations in R and SciPy both use multiple 
start points to find candidate locations for a search for a maxima. This is 
done by using N uniform points in (0, 1) and then (optionally) optimising the 
best candidate to find the maximum. The function requires numerical 
differentiation and would be suitable for a non-derivative method such as Brent 
optimisation for the univariate case.

See also:

[https://en.wikipedia.org/wiki/Boschloo%27s_test]

[https://en.wikipedia.org/wiki/Barnard%27s_test]

 


> Add an unconditioned exact test for 2x2 contingency tables
> ----------------------------------------------------------
>
>                 Key: STATISTICS-69
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-69
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>            Reporter: Alex Herbert
>            Priority: Minor
>             Fix For: 1.1
>
>
> A 2x2 contingency table [[a, b], [c, d]] is used to visualize N independent 
> observations of two binary variables (G or g and H or h):
>  
> {noformat}
>     G g
>   -------
> H | a b | m
> h | c d | n
>   ----------
>     s r | N{noformat}
> The probability distributions are classified into 3 cases:
>  # The row and column sums are fixed in advance. All table entries are 
> determined by a. This follows a hypergeometric distribution with parameters 
> N, m, s.
>  # The row sums are fixed, but the column sums are not. All table entries are 
> determined by a and c. The distribution is a joint binomial distribution with 
> probabilities p0 and p1:
> a ~ B(m, p0); c ~ B(n, p1)
>  # Only the total N is fixed (row and columns sums are not). The table (a, b, 
> c, d) is a multinomial distribution.
> Case 1 is covered by using Fisher's exact test (see [STATISTICS 64]). It does 
> not occur in practice very often as the column and row sums are both fixed in 
> advance. This is an exact conditioned test (as it conditions on the row sums).
> Case 2 is more common where the row sums are fixed but the columns are not. 
> For example a clinical trial with two groups of fixed size (e.g. medication 
> or placebo); the outcome of cure or no cure for each of the patients is 
> unknown.
> Case 3 is rare. For example flipping two coins N times and totalling the 
> heads/tails for each independently.
> I propose adding a test that can handle an unconditioned exact test. Case 2 
> is the more common and simpler to support. It involves generating a test 
> statistic for each possible table given the fixed totals. The p-value is 
> obtained from a subset of the possible test statistics that are more extreme 
> that the observed table. Alternatively the subset is maximised by 
> incrementally adding candidates based on which next sized subset has the 
> smallest p-value. This is the CSM (Convexity, Symmetry, Minimization) test of 
> Barnard (1945). This is computational expensive and benefits from precomputed 
> tables which ranks the order of tables for a given size (m,n). In either case 
> the computation of the p-value involves maximising the p-value given a 
> nuisance parameter in the range (0, 1).
> Possible test statistics are Fisher's p-value for the table (known as 
> Boschloo's test (1970)), or using a Z-pooled or Z-Unpooled statistic. 
> Implementation of the CSM test is computationally intense.
> There is a reference implementation in R as the Exact package:
> [https://cran.r-project.org/web/packages/Exact/Exact.pdf]
> SciPy has implementation of Boshloo's and the z-pooled/unpooled test (which 
> they name Barnard's test):
> [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boschloo_exact.html]
> [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.barnard_exact.html]
> Note that the search for the nuisance parameter involves a univariate 
> function with multiple minima. The implementations in R and SciPy both use 
> multiple start points to find candidate locations for a search for a maxima. 
> This is done by using N uniform points in (0, 1) and then (optionally) 
> optimising the best candidate to find the maximum. The function requires 
> numerical differentiation and would be suitable for a non-derivative method 
> such as Brent optimisation for the univariate case.
> See also:
> [https://en.wikipedia.org/wiki/Boschloo%27s_test]
> [https://en.wikipedia.org/wiki/Barnard%27s_test]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (STATISTICS-69) Add an unconditioned exact test for 2x2 contingency tables

Reply via email to