Thanks for your answer.

> The difference seems (thinking out loud) to stem from assumptions
> about the input. feature_selection.chi2 (implicitly) assumes a
> multinomial event model, so each X[i, j] is the frequency with which
> event j was observed when drawing X[i].sum() times from a multinomial.
> A zero input value is interpreted as the absence of an event, rather
> than a separate 0 event.

If I understand you correctly, one way to reconcile the difference
between the two interpretations (multinomial vs binomial) would be to
binarize first my boolean input variable:

>>> A = np.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] * 33))
>>> X = A[:, [0]]
>>> X = np.append(1 - X, X, axis=1)
>>> X.shape
(100, 2)
>>> y = X[:, 1]
>>> sklearn.feature_selection.chi2(X, y)
(array([ 1.5,  0.5]), array([ 0.22067136,  0.47950012]))

Summing the chi-squared value of each feature (1.5 + 0.5) then yields
the same result as obtained with scipy.stats.chi2_contingency. Does
that make sense?

If so, I have one last question: if feature_selection.chi2 always
assumes a multinomial event model, does that mean that whenever one
tries to use it the way I did (i.e. assuming a binomial event model),
he would silently obtain wrong results? Isn't there a use for the
binomial case?

Thanks,

Christian

------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to