Thanks for your answer. > The difference seems (thinking out loud) to stem from assumptions > about the input. feature_selection.chi2 (implicitly) assumes a > multinomial event model, so each X[i, j] is the frequency with which > event j was observed when drawing X[i].sum() times from a multinomial. > A zero input value is interpreted as the absence of an event, rather > than a separate 0 event.
If I understand you correctly, one way to reconcile the difference between the two interpretations (multinomial vs binomial) would be to binarize first my boolean input variable: >>> A = np.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] * 33)) >>> X = A[:, [0]] >>> X = np.append(1 - X, X, axis=1) >>> X.shape (100, 2) >>> y = X[:, 1] >>> sklearn.feature_selection.chi2(X, y) (array([ 1.5, 0.5]), array([ 0.22067136, 0.47950012])) Summing the chi-squared value of each feature (1.5 + 0.5) then yields the same result as obtained with scipy.stats.chi2_contingency. Does that make sense? If so, I have one last question: if feature_selection.chi2 always assumes a multinomial event model, does that mean that whenever one tries to use it the way I did (i.e. assuming a binomial event model), he would silently obtain wrong results? Isn't there a use for the binomial case? Thanks, Christian ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general