Alex Herbert created MATH-1627:
----------------------------------

             Summary: ChiSquareTest computes NaN with zero observations
                 Key: MATH-1627
                 URL: https://issues.apache.org/jira/browse/MATH-1627
             Project: Commons Math
          Issue Type: Bug
    Affects Versions: 4.0
            Reporter: Alex Herbert


Zero observations input to the ChiSquareTest will compute NaN:
{code:java}
ChiSquareTest chi2Test = new ChiSquareTest();
final long[][] counts = new long[2][2];
// NaN
double chi2 = chi2Test.chiSquare(counts);
{code}
This is due to a divide by zero error. This bug was identified by sonarcloud 
analysis.

The unit tests use R as a reference. In R this case will raise an error that at 
least one entry must be positive. Setting a value to 1 allows R to compute a 
Chi-square test value but the value is not valid:
{code:r}
> m <- array(c(1,0,0,0), dim = c(2,2))
> chisq.test(m)

        Pearson's Chi-squared test

data:  m
X-squared = NaN, df = 1, p-value = NA

Warning message:
In chisq.test(m) : Chi-squared approximation may be incorrect
{code}
Other methods in the ChiSquareTest will raise a ZeroException if the 
observations are zero for an entire array of observations or if a pair of 
observations in a bin are both zero.

The Chi square test has assumptions that do not hold when the number of 
observations are small. The limit for the number of observations per category 
is variable. The document referenced in the code javadoc recommends an expected 
level of 5 per bin. To avoid setting limits on the sample size a suggested fix 
is to raise a zero exception if the sum of all counts is zero. This will avoid 
a NaN computation. Use of a suitable number of observations is left to the 
caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to