Hi,
Suppose I wanted to test the independence of two boolean variables using
Chi-Square:
>>> X = numpy.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] *
33))
>>> X.shape
(100, 2)
I'd like to understand the difference between doing:
>>> sklearn.feature_selection.chi2(X[:,[0]], X[:,1])
(array([ 0.5]), array([ 0.47950012]))
and doing:
>>> pandas.crosstab(X[:,0], X[:,1])
col_0 0 1
row_0
0 18 7
1 42 33
>>> scipy.stats.chi2_contingency(pd.crosstab(X[:,0], X[:,1]),
correction=False)
(2.0, 0.15729920705028505, 1, array([[ 15., 10.],
[ 45., 30.]]))
What explains the difference in terms of the Chi-Square value (0.5 vs 2)
and the P-value (0.48 vs 0.157)?
Thanks,
Christian
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general