[
https://issues.apache.org/jira/browse/MADLIB-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172737#comment-15172737
]
Gautam Muralidhar commented on MADLIB-970:
------------------------------------------
[~fmcquillan] The results of the Chi-squared test might not be very meaningful
for contingency tables (whose size ~= 2x2) when more than 20% of the cells in
the contingency table have expected counts < 5. The actual conditions that need
to be satisfied include "no more than 20% of the cells have expected counts < 5
and all cells have expected counts > 1".
For 2x2 tables, ideally all expected counts >= 10. If counts are < 10, but >=
5, then some authors recommend using the Yates continuity correction while
computing the test statistic. However, this is not always universally adopted.
So, for a 2x2 table, we can require all cells to have expected counts >= 5. If
the expect counts condition does not hold, then the advise would be to use
other exact tests such as Fisher's exact test.
> Log a warning message when running Chi squared independence test if more
> than 20% of the cells in the contingency table have expected values < 5.
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MADLIB-970
> URL: https://issues.apache.org/jira/browse/MADLIB-970
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Inferential Statistics
> Reporter: Gautam Muralidhar
> Priority: Minor
>
> Log a warning message when running Chi squared independence test if more
> than 20% of the cells in the contingency table have expected values < 5. It
> might be acceptable to not proceed with the computation as well if more than
> 20% of the cells in the contingency table have expected values < 5.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)