hi, just a thought. You seem to be doing inter-subject prediction. In this case a 5 fold mixes subjects. A hint is that you may have a subject effect that acts as a confound.
again just a thought ready the email quickly Alex On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> wrote: > Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa > list, but figured one at a time :) > > The scikit-learn LogisticRegression class uses one-vs-all in a > multiclass setting, although I also tried it with their one-vs-one > metaclassifier with similar "weird" results. > > Interestingly, though, I think the multiclass setting is a red > herring. For this dataset we also have a two-class condition (you can > think of the paradigm as a 3x2 design, although we're analyzing them > separately), which has the same thing happening: > > cv = LeaveOneLabelOut(labels) > print cross_val_score(pipe, X, y, cv=cv).mean() > for train, test in cv: > pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())]) > print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] > > 0.496377606851 > [ 0 68] > [ 0 70] > [ 0 67] > [ 0 69] > > cv = LeaveOneLabelOut(np.random.permutation(labels)) > pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())]) > print cross_val_score(pipe, X, y, cv=cv).mean() > for train, test in cv: > print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] > > 0.532455733754 > [40 28] > [36 34] > [33 34] > [31 38] > > Best, > Michael > > On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> > wrote: >> just to educate myself -- how sklearn does multiclass decisions in this >> case? if it is all pairs classification + voting, then the answer is >> simple -- ties, and the "first one in order" would take all those. >> >> but if there is no ties involved then, theoretically (since not sure if it is >> applicable to your data) it is easy to come up with non-linear scenarios for >> binary classification where 1 class would be better classified than the other >> one with a linear classifier... e.g. here is an example (sorry -- pymvpa) >> with >> an embedded normal (i.e. both classes mean at the same spot but have >> significantly different variances) >> >> from mvpa2.suite import * >> ns, nf = 100, 10 >> ds = dataset_wizard( >> np.vstack(( >> np.random.normal(size=(ns, nf)), >> 10*np.random.normal(size=(ns, nf)))), >> targets=['narrow']*ns + ['wide']*ns, >> chunks=[0,1]*ns) >> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(), >> enable_ca=['stats']) >> cv(ds).samples >> print cv.ca.stats >> >> yields >> >> ----------. >> predictions\targets narrow wide >> `------ ------ ------ P' N' FP FN PPV NPV TPR SPC >> FDR MCC AUC >> narrow 100 74 174 26 74 0 0.57 1 1 0.26 >> 0.43 0.39 0.41 >> wide 0 26 26 174 0 74 1 0.57 0.26 1 0 >> 0.39 0.41 >> Per target: ------ ------ >> P 100 100 >> N 100 100 >> TP 100 26 >> TN 26 100 >> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 0.63 >> 0.21 0.39 0.41 >> CHI^2 123.04 p=1.7e-26 >> ACC 0.63 >> ACC% 63 >> # of sets 2 >> >> >> I bet with a bit of creativity, classifier-dependent cases of similar >> cases could be found for linear underlying models. >> >> Cheers, >> >> On Sat, 28 Jan 2012, Michael Waskom wrote: >> >>> Hi Folks, >> >>> I hope you don't mind a question that's a mix of general machine >>> learning and scikit-learn. I'm happy to kick it over to metaoptimize, >>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn >>> perspective so I thought it best to ask here first. >> >>> I'm doing classification of fMRI data using logistic regression. I've >>> been playing around with things for the past couple days and was >>> getting accuracies right around or slightly above chance, which was >>> disappointing. >>> Initially, my code looked a bit like this: >> >>> pipeline = Pipeline([("scale", Scaler()), ("classify", >>> LogisticRegression())]) >>> cv = LeaveOneLabelOut(labels) >>> acc = cross_val_score(pipeline, X, y, cv=cv).mean() >>> print acc >> >>> 0.358599857854 >> >>> Labels are an int in [1, 4] specifying which fmri run each sample came >>> from, and y has three classes. >> >>> When I went to inspect the predictions being made, though, I realized >>> in each split one class was almost completely dominating: >> >>> cv = LeaveOneLabelOut(labels) >>> for train, test in cv: >>> pipe = Pipeline([("scale", Scaler()), ("classify", >>> LogisticRegression())]) >>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >> >>> [58 0 11] >>> [67 0 3] >>> [ 0 70 0] >>> [ 0 67 0] >> >>> Which doesn't seem right at all. I realized that if I disregard the >>> labels and just run 5-fold cross validation, though, the balance of >>> predictions looks much more like what I would expect: >> >>> cv = KFold(len(y), 5) >>> for train, test in cv: >>> pipe = Pipeline([("scale", Scaler()), ("classify", >>> LogisticRegression())]) >>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >> >>> [22 16 17] >>> [25 14 16] >>> [17 25 13] >>> [36 6 13] >>> [37 9 10] >> >>> (Although note the still relative dominance of the first class). When >>> I go back and run the full analysis this way, I get accuracies more in >>> line with what I would have expected from previous fMRI studies in >>> this domain. >> >>> My design is slow event-related, so my samples should be independent >>> at least as far as HRF-blurring is considered. >> >>> I'm not considering error trials so the number of samples for each >>> class is not perfectly balanced, but participants are near ceiling and >>> thus they are very close: >> >>> cv = LeaveOneLabelOut(labels) >>> for train, test in cv: >>> print histogram(y[train], 3)[0] >> >>> [71 67 69] >>> [71 68 67] >>> [70 69 67] >>> [70 69 70] >> >> >>> Apologies for the long explanation. Two questions, really: >> >>> 1) Does it look like I'm doing anything obviously wrong? >> >>> 2) If not, can you help me build some intuition about why this is >>> happening and what it means? Or suggest things I could look at in my >>> data/code to identify the source of the problem? >> >>> I really appreciate it! Aside from this befuddling issue, I've found >>> scikit-learn an absolute delight to use! >> >>> Best, >>> Michael >> -- >> =------------------------------------------------------------------= >> Keep in touch www.onerussian.com >> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
