Re: [Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

Lars Buitinck Mon, 30 Jun 2014 07:48:49 -0700

2014-06-30 0:28 GMT+02:00 Christian Jauvin <[email protected]>:
> What explains the difference in terms of the Chi-Square value (0.5 vs 2) and
> the P-value (0.48 vs 0.157)?


Here's the feature_extraction.chi2 algorithm:

>>> A = numpy.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] * 33))
>>> X = A[:, [0]]
>>> y = X[:, 1]
>>> Y = LabelBinarizer().fit_transform(y)
>>> if Y.shape[1] == 1:
...     Y = np.append(1 - Y, Y, axis=1)
...
>>> observed = np.dot(Y.T, X)
>>> feature_count = X.sum(axis=0)
>>> class_prob = Y.mean(axis=0)
>>> expected = np.dot(class_prob.T, feature_count)
>>> chisquare(observed, expected)
(array([ 0.5]), array([ 0.47950012]))

Note that observed matches the second row of your cross-table:

>>> observed.ravel()
array([42, 33])

The difference seems (thinking out loud) to stem from assumptions
about the input. feature_selection.chi2 (implicitly) assumes a
multinomial event model, so each X[i, j] is the frequency with which
event j was observed when drawing X[i].sum() times from a multinomial.
A zero input value is interpreted as the absence of an event, rather
than a separate 0 event.

------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

Reply via email to