ok some more suggestions:
- do you observe the same behavior with SVC which uses a different multiclass strategy? - what do you see when you inspect results obtained with binary predictions (keeping 2 classes at a time)? Alex On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom <[email protected]> wrote: > Hi Alex, > > No, each subject has four runs so I'm doing leave-one-run-out cross > validation in the original case. I'm estimating separate models within > each subject (as is common in fmri) so all my example code here would > be from within a for subject in subjects: loop, but this pattern of > weirdness is happening in every subject I've looked at so far. > > Michael > > On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort > <[email protected]> wrote: >> hi, >> >> just a thought. You seem to be doing inter-subject prediction. In this case >> a 5 fold mixes subjects. A hint is that you may have a subject effect that >> acts as a confound. >> >> again just a thought ready the email quickly >> >> Alex >> >> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> wrote: >>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa >>> list, but figured one at a time :) >>> >>> The scikit-learn LogisticRegression class uses one-vs-all in a >>> multiclass setting, although I also tried it with their one-vs-one >>> metaclassifier with similar "weird" results. >>> >>> Interestingly, though, I think the multiclass setting is a red >>> herring. For this dataset we also have a two-class condition (you can >>> think of the paradigm as a 3x2 design, although we're analyzing them >>> separately), which has the same thing happening: >>> >>> cv = LeaveOneLabelOut(labels) >>> print cross_val_score(pipe, X, y, cv=cv).mean() >>> for train, test in cv: >>> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())]) >>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>> >>> 0.496377606851 >>> [ 0 68] >>> [ 0 70] >>> [ 0 67] >>> [ 0 69] >>> >>> cv = LeaveOneLabelOut(np.random.permutation(labels)) >>> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())]) >>> print cross_val_score(pipe, X, y, cv=cv).mean() >>> for train, test in cv: >>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>> >>> 0.532455733754 >>> [40 28] >>> [36 34] >>> [33 34] >>> [31 38] >>> >>> Best, >>> Michael >>> >>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> >>> wrote: >>>> just to educate myself -- how sklearn does multiclass decisions in this >>>> case? if it is all pairs classification + voting, then the answer is >>>> simple -- ties, and the "first one in order" would take all those. >>>> >>>> but if there is no ties involved then, theoretically (since not sure if it >>>> is >>>> applicable to your data) it is easy to come up with non-linear scenarios >>>> for >>>> binary classification where 1 class would be better classified than the >>>> other >>>> one with a linear classifier... e.g. here is an example (sorry -- pymvpa) >>>> with >>>> an embedded normal (i.e. both classes mean at the same spot but have >>>> significantly different variances) >>>> >>>> from mvpa2.suite import * >>>> ns, nf = 100, 10 >>>> ds = dataset_wizard( >>>> np.vstack(( >>>> np.random.normal(size=(ns, nf)), >>>> 10*np.random.normal(size=(ns, nf)))), >>>> targets=['narrow']*ns + ['wide']*ns, >>>> chunks=[0,1]*ns) >>>> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(), >>>> enable_ca=['stats']) >>>> cv(ds).samples >>>> print cv.ca.stats >>>> >>>> yields >>>> >>>> ----------. >>>> predictions\targets narrow wide >>>> `------ ------ ------ P' N' FP FN PPV NPV TPR SPC >>>> FDR MCC AUC >>>> narrow 100 74 174 26 74 0 0.57 1 1 0.26 >>>> 0.43 0.39 0.41 >>>> wide 0 26 26 174 0 74 1 0.57 0.26 1 >>>> 0 0.39 0.41 >>>> Per target: ------ ------ >>>> P 100 100 >>>> N 100 100 >>>> TP 100 26 >>>> TN 26 100 >>>> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 0.63 >>>> 0.21 0.39 0.41 >>>> CHI^2 123.04 p=1.7e-26 >>>> ACC 0.63 >>>> ACC% 63 >>>> # of sets 2 >>>> >>>> >>>> I bet with a bit of creativity, classifier-dependent cases of similar >>>> cases could be found for linear underlying models. >>>> >>>> Cheers, >>>> >>>> On Sat, 28 Jan 2012, Michael Waskom wrote: >>>> >>>>> Hi Folks, >>>> >>>>> I hope you don't mind a question that's a mix of general machine >>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize, >>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn >>>>> perspective so I thought it best to ask here first. >>>> >>>>> I'm doing classification of fMRI data using logistic regression. I've >>>>> been playing around with things for the past couple days and was >>>>> getting accuracies right around or slightly above chance, which was >>>>> disappointing. >>>>> Initially, my code looked a bit like this: >>>> >>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", >>>>> LogisticRegression())]) >>>>> cv = LeaveOneLabelOut(labels) >>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean() >>>>> print acc >>>> >>>>> 0.358599857854 >>>> >>>>> Labels are an int in [1, 4] specifying which fmri run each sample came >>>>> from, and y has three classes. >>>> >>>>> When I went to inspect the predictions being made, though, I realized >>>>> in each split one class was almost completely dominating: >>>> >>>>> cv = LeaveOneLabelOut(labels) >>>>> for train, test in cv: >>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>> LogisticRegression())]) >>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >>>> >>>>> [58 0 11] >>>>> [67 0 3] >>>>> [ 0 70 0] >>>>> [ 0 67 0] >>>> >>>>> Which doesn't seem right at all. I realized that if I disregard the >>>>> labels and just run 5-fold cross validation, though, the balance of >>>>> predictions looks much more like what I would expect: >>>> >>>>> cv = KFold(len(y), 5) >>>>> for train, test in cv: >>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>> LogisticRegression())]) >>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >>>> >>>>> [22 16 17] >>>>> [25 14 16] >>>>> [17 25 13] >>>>> [36 6 13] >>>>> [37 9 10] >>>> >>>>> (Although note the still relative dominance of the first class). When >>>>> I go back and run the full analysis this way, I get accuracies more in >>>>> line with what I would have expected from previous fMRI studies in >>>>> this domain. >>>> >>>>> My design is slow event-related, so my samples should be independent >>>>> at least as far as HRF-blurring is considered. >>>> >>>>> I'm not considering error trials so the number of samples for each >>>>> class is not perfectly balanced, but participants are near ceiling and >>>>> thus they are very close: >>>> >>>>> cv = LeaveOneLabelOut(labels) >>>>> for train, test in cv: >>>>> print histogram(y[train], 3)[0] >>>> >>>>> [71 67 69] >>>>> [71 68 67] >>>>> [70 69 67] >>>>> [70 69 70] >>>> >>>> >>>>> Apologies for the long explanation. Two questions, really: >>>> >>>>> 1) Does it look like I'm doing anything obviously wrong? >>>> >>>>> 2) If not, can you help me build some intuition about why this is >>>>> happening and what it means? Or suggest things I could look at in my >>>>> data/code to identify the source of the problem? >>>> >>>>> I really appreciate it! Aside from this befuddling issue, I've found >>>>> scikit-learn an absolute delight to use! >>>> >>>>> Best, >>>>> Michael >>>> -- >>>> =------------------------------------------------------------------= >>>> Keep in touch www.onerussian.com >>>> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic >>>> >>>> ------------------------------------------------------------------------------ >>>> Try before you buy = See our experts in action! >>>> The most comprehensive online learning library for Microsoft developers >>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>> http://p.sf.net/sfu/learndevnow-dev2 >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> ------------------------------------------------------------------------------ >>> Try before you buy = See our experts in action! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-dev2 >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
