2014-06-30 0:28 GMT+02:00 Christian Jauvin <cjau...@gmail.com>:
> What explains the difference in terms of the Chi-Square value (0.5 vs 2) and
> the P-value (0.48 vs 0.157)?

Here's the feature_extraction.chi2 algorithm:

>>> A = numpy.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] * 33))
>>> X = A[:, [0]]
>>> y = X[:, 1]
>>> Y = LabelBinarizer().fit_transform(y)
>>> if Y.shape[1] == 1:
...     Y = np.append(1 - Y, Y, axis=1)
...
>>> observed = np.dot(Y.T, X)
>>> feature_count = X.sum(axis=0)
>>> class_prob = Y.mean(axis=0)
>>> expected = np.dot(class_prob.T, feature_count)
>>> chisquare(observed, expected)
(array([ 0.5]), array([ 0.47950012]))

Note that observed matches the second row of your cross-table:

>>> observed.ravel()
array([42, 33])

The difference seems (thinking out loud) to stem from assumptions
about the input. feature_selection.chi2 (implicitly) assumes a
multinomial event model, so each X[i, j] is the frequency with which
event j was observed when drawing X[i].sum() times from a multinomial.
A zero input value is interpreted as the absence of an event, rather
than a separate 0 event.

------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to