Hi Alex, No, each subject has four runs so I'm doing leave-one-run-out cross validation in the original case. I'm estimating separate models within each subject (as is common in fmri) so all my example code here would be from within a for subject in subjects: loop, but this pattern of weirdness is happening in every subject I've looked at so far.
Michael On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort <[email protected]> wrote: > hi, > > just a thought. You seem to be doing inter-subject prediction. In this case > a 5 fold mixes subjects. A hint is that you may have a subject effect that > acts as a confound. > > again just a thought ready the email quickly > > Alex > > On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> wrote: >> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa >> list, but figured one at a time :) >> >> The scikit-learn LogisticRegression class uses one-vs-all in a >> multiclass setting, although I also tried it with their one-vs-one >> metaclassifier with similar "weird" results. >> >> Interestingly, though, I think the multiclass setting is a red >> herring. For this dataset we also have a two-class condition (you can >> think of the paradigm as a 3x2 design, although we're analyzing them >> separately), which has the same thing happening: >> >> cv = LeaveOneLabelOut(labels) >> print cross_val_score(pipe, X, y, cv=cv).mean() >> for train, test in cv: >> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())]) >> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >> >> 0.496377606851 >> [ 0 68] >> [ 0 70] >> [ 0 67] >> [ 0 69] >> >> cv = LeaveOneLabelOut(np.random.permutation(labels)) >> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())]) >> print cross_val_score(pipe, X, y, cv=cv).mean() >> for train, test in cv: >> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >> >> 0.532455733754 >> [40 28] >> [36 34] >> [33 34] >> [31 38] >> >> Best, >> Michael >> >> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> >> wrote: >>> just to educate myself -- how sklearn does multiclass decisions in this >>> case? if it is all pairs classification + voting, then the answer is >>> simple -- ties, and the "first one in order" would take all those. >>> >>> but if there is no ties involved then, theoretically (since not sure if it >>> is >>> applicable to your data) it is easy to come up with non-linear scenarios for >>> binary classification where 1 class would be better classified than the >>> other >>> one with a linear classifier... e.g. here is an example (sorry -- pymvpa) >>> with >>> an embedded normal (i.e. both classes mean at the same spot but have >>> significantly different variances) >>> >>> from mvpa2.suite import * >>> ns, nf = 100, 10 >>> ds = dataset_wizard( >>> np.vstack(( >>> np.random.normal(size=(ns, nf)), >>> 10*np.random.normal(size=(ns, nf)))), >>> targets=['narrow']*ns + ['wide']*ns, >>> chunks=[0,1]*ns) >>> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(), >>> enable_ca=['stats']) >>> cv(ds).samples >>> print cv.ca.stats >>> >>> yields >>> >>> ----------. >>> predictions\targets narrow wide >>> `------ ------ ------ P' N' FP FN PPV NPV TPR SPC >>> FDR MCC AUC >>> narrow 100 74 174 26 74 0 0.57 1 1 0.26 >>> 0.43 0.39 0.41 >>> wide 0 26 26 174 0 74 1 0.57 0.26 1 >>> 0 0.39 0.41 >>> Per target: ------ ------ >>> P 100 100 >>> N 100 100 >>> TP 100 26 >>> TN 26 100 >>> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 0.63 >>> 0.21 0.39 0.41 >>> CHI^2 123.04 p=1.7e-26 >>> ACC 0.63 >>> ACC% 63 >>> # of sets 2 >>> >>> >>> I bet with a bit of creativity, classifier-dependent cases of similar >>> cases could be found for linear underlying models. >>> >>> Cheers, >>> >>> On Sat, 28 Jan 2012, Michael Waskom wrote: >>> >>>> Hi Folks, >>> >>>> I hope you don't mind a question that's a mix of general machine >>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize, >>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn >>>> perspective so I thought it best to ask here first. >>> >>>> I'm doing classification of fMRI data using logistic regression. I've >>>> been playing around with things for the past couple days and was >>>> getting accuracies right around or slightly above chance, which was >>>> disappointing. >>>> Initially, my code looked a bit like this: >>> >>>> pipeline = Pipeline([("scale", Scaler()), ("classify", >>>> LogisticRegression())]) >>>> cv = LeaveOneLabelOut(labels) >>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean() >>>> print acc >>> >>>> 0.358599857854 >>> >>>> Labels are an int in [1, 4] specifying which fmri run each sample came >>>> from, and y has three classes. >>> >>>> When I went to inspect the predictions being made, though, I realized >>>> in each split one class was almost completely dominating: >>> >>>> cv = LeaveOneLabelOut(labels) >>>> for train, test in cv: >>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>> LogisticRegression())]) >>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >>> >>>> [58 0 11] >>>> [67 0 3] >>>> [ 0 70 0] >>>> [ 0 67 0] >>> >>>> Which doesn't seem right at all. I realized that if I disregard the >>>> labels and just run 5-fold cross validation, though, the balance of >>>> predictions looks much more like what I would expect: >>> >>>> cv = KFold(len(y), 5) >>>> for train, test in cv: >>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>> LogisticRegression())]) >>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >>> >>>> [22 16 17] >>>> [25 14 16] >>>> [17 25 13] >>>> [36 6 13] >>>> [37 9 10] >>> >>>> (Although note the still relative dominance of the first class). When >>>> I go back and run the full analysis this way, I get accuracies more in >>>> line with what I would have expected from previous fMRI studies in >>>> this domain. >>> >>>> My design is slow event-related, so my samples should be independent >>>> at least as far as HRF-blurring is considered. >>> >>>> I'm not considering error trials so the number of samples for each >>>> class is not perfectly balanced, but participants are near ceiling and >>>> thus they are very close: >>> >>>> cv = LeaveOneLabelOut(labels) >>>> for train, test in cv: >>>> print histogram(y[train], 3)[0] >>> >>>> [71 67 69] >>>> [71 68 67] >>>> [70 69 67] >>>> [70 69 70] >>> >>> >>>> Apologies for the long explanation. Two questions, really: >>> >>>> 1) Does it look like I'm doing anything obviously wrong? >>> >>>> 2) If not, can you help me build some intuition about why this is >>>> happening and what it means? Or suggest things I could look at in my >>>> data/code to identify the source of the problem? >>> >>>> I really appreciate it! Aside from this befuddling issue, I've found >>>> scikit-learn an absolute delight to use! >>> >>>> Best, >>>> Michael >>> -- >>> =------------------------------------------------------------------= >>> Keep in touch www.onerussian.com >>> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic >>> >>> ------------------------------------------------------------------------------ >>> Try before you buy = See our experts in action! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-dev2 >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
