2014-06-30 0:28 GMT+02:00 Christian Jauvin <cjau...@gmail.com>: > What explains the difference in terms of the Chi-Square value (0.5 vs 2) and > the P-value (0.48 vs 0.157)?
Here's the feature_extraction.chi2 algorithm: >>> A = numpy.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] * 33)) >>> X = A[:, [0]] >>> y = X[:, 1] >>> Y = LabelBinarizer().fit_transform(y) >>> if Y.shape[1] == 1: ... Y = np.append(1 - Y, Y, axis=1) ... >>> observed = np.dot(Y.T, X) >>> feature_count = X.sum(axis=0) >>> class_prob = Y.mean(axis=0) >>> expected = np.dot(class_prob.T, feature_count) >>> chisquare(observed, expected) (array([ 0.5]), array([ 0.47950012])) Note that observed matches the second row of your cross-table: >>> observed.ravel() array([42, 33]) The difference seems (thinking out loud) to stem from assumptions about the input. feature_selection.chi2 (implicitly) assumes a multinomial event model, so each X[i, j] is the frequency with which event j was observed when drawing X[i].sum() times from a multinomial. A zero input value is interpreted as the absence of an event, rather than a separate 0 event. ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general