[Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

Christian Jauvin Sun, 29 Jun 2014 15:29:43 -0700

Hi,

Suppose I wanted to test the independence of two boolean variables using
Chi-Square:


>>> X = numpy.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] *
33))
>>> X.shape
(100, 2)

I'd like to understand the difference between doing:

>>> sklearn.feature_selection.chi2(X[:,[0]], X[:,1])
(array([ 0.5]), array([ 0.47950012]))

and doing:

>>> pandas.crosstab(X[:,0], X[:,1])
col_0   0   1
row_0
0      18   7
1      42  33
>>> scipy.stats.chi2_contingency(pd.crosstab(X[:,0], X[:,1]),
correction=False)
(2.0, 0.15729920705028505, 1, array([[ 15.,  10.],
        [ 45.,  30.]]))

What explains the difference in terms of the Chi-Square value (0.5 vs 2)
and the P-value (0.48 vs 0.157)?

Thanks,

Christian

------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

Reply via email to