hum... final suggestion: I would try to visualize a 2D or 3D PCA to see if it can give me some intuition on what's happening.
Alex On Sun, Jan 29, 2012 at 9:58 PM, Michael Waskom <[email protected]> wrote: > Hi Alex, > > See my response to Yarick for some results from a binary > classification. I reran both the three-way and binary classification > with SVC, though, with similar results: > > cv = LeaveOneLabelOut(bin_labels) > pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) > print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean() > for train, test in cv: > pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) > print histogram(pipe.fit(bin_X[train], > bin_y[train]).predict(bin_X[test]), 2)[0] > > 0.496377606851 > [ 0 68] > [ 0 70] > [ 0 67] > [ 0 69] > > cv = LeaveOneLabelOut(tri_labels) > pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) > print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean() > for train, test in cv: > pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) > print histogram(pipe.fit(tri_X[train], > tri_y[train]).predict(tri_X[test]), 3)[0] > > 0.386755821732 > [20 0 48] > [29 1 40] > [ 2 0 65] > [ 0 69 0] > > On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort > <[email protected]> wrote: >> ok >> >> some more suggestions: >> >> - do you observe the same behavior with SVC which uses a different >> multiclass strategy? >> - what do you see when you inspect results obtained with binary >> predictions (keeping 2 classes at a time)? >> >> Alex >> >> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom <[email protected]> wrote: >>> Hi Alex, >>> >>> No, each subject has four runs so I'm doing leave-one-run-out cross >>> validation in the original case. I'm estimating separate models within >>> each subject (as is common in fmri) so all my example code here would >>> be from within a for subject in subjects: loop, but this pattern of >>> weirdness is happening in every subject I've looked at so far. >>> >>> Michael >>> >>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort >>> <[email protected]> wrote: >>>> hi, >>>> >>>> just a thought. You seem to be doing inter-subject prediction. In this case >>>> a 5 fold mixes subjects. A hint is that you may have a subject effect that >>>> acts as a confound. >>>> >>>> again just a thought ready the email quickly >>>> >>>> Alex >>>> >>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> >>>> wrote: >>>>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa >>>>> list, but figured one at a time :) >>>>> >>>>> The scikit-learn LogisticRegression class uses one-vs-all in a >>>>> multiclass setting, although I also tried it with their one-vs-one >>>>> metaclassifier with similar "weird" results. >>>>> >>>>> Interestingly, though, I think the multiclass setting is a red >>>>> herring. For this dataset we also have a two-class condition (you can >>>>> think of the paradigm as a 3x2 design, although we're analyzing them >>>>> separately), which has the same thing happening: >>>>> >>>>> cv = LeaveOneLabelOut(labels) >>>>> print cross_val_score(pipe, X, y, cv=cv).mean() >>>>> for train, test in cv: >>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>> LogisticRegression())]) >>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>>>> >>>>> 0.496377606851 >>>>> [ 0 68] >>>>> [ 0 70] >>>>> [ 0 67] >>>>> [ 0 69] >>>>> >>>>> cv = LeaveOneLabelOut(np.random.permutation(labels)) >>>>> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())]) >>>>> print cross_val_score(pipe, X, y, cv=cv).mean() >>>>> for train, test in cv: >>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>>>> >>>>> 0.532455733754 >>>>> [40 28] >>>>> [36 34] >>>>> [33 34] >>>>> [31 38] >>>>> >>>>> Best, >>>>> Michael >>>>> >>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> >>>>> wrote: >>>>>> just to educate myself -- how sklearn does multiclass decisions in this >>>>>> case? if it is all pairs classification + voting, then the answer is >>>>>> simple -- ties, and the "first one in order" would take all those. >>>>>> >>>>>> but if there is no ties involved then, theoretically (since not sure if >>>>>> it is >>>>>> applicable to your data) it is easy to come up with non-linear scenarios >>>>>> for >>>>>> binary classification where 1 class would be better classified than the >>>>>> other >>>>>> one with a linear classifier... e.g. here is an example (sorry -- >>>>>> pymvpa) with >>>>>> an embedded normal (i.e. both classes mean at the same spot but have >>>>>> significantly different variances) >>>>>> >>>>>> from mvpa2.suite import * >>>>>> ns, nf = 100, 10 >>>>>> ds = dataset_wizard( >>>>>> np.vstack(( >>>>>> np.random.normal(size=(ns, nf)), >>>>>> 10*np.random.normal(size=(ns, nf)))), >>>>>> targets=['narrow']*ns + ['wide']*ns, >>>>>> chunks=[0,1]*ns) >>>>>> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(), >>>>>> enable_ca=['stats']) >>>>>> cv(ds).samples >>>>>> print cv.ca.stats >>>>>> >>>>>> yields >>>>>> >>>>>> ----------. >>>>>> predictions\targets narrow wide >>>>>> `------ ------ ------ P' N' FP FN PPV NPV TPR SPC >>>>>> FDR MCC AUC >>>>>> narrow 100 74 174 26 74 0 0.57 1 1 0.26 >>>>>> 0.43 0.39 0.41 >>>>>> wide 0 26 26 174 0 74 1 0.57 0.26 1 >>>>>> 0 0.39 0.41 >>>>>> Per target: ------ ------ >>>>>> P 100 100 >>>>>> N 100 100 >>>>>> TP 100 26 >>>>>> TN 26 100 >>>>>> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 0.63 >>>>>> 0.21 0.39 0.41 >>>>>> CHI^2 123.04 p=1.7e-26 >>>>>> ACC 0.63 >>>>>> ACC% 63 >>>>>> # of sets 2 >>>>>> >>>>>> >>>>>> I bet with a bit of creativity, classifier-dependent cases of similar >>>>>> cases could be found for linear underlying models. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote: >>>>>> >>>>>>> Hi Folks, >>>>>> >>>>>>> I hope you don't mind a question that's a mix of general machine >>>>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize, >>>>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn >>>>>>> perspective so I thought it best to ask here first. >>>>>> >>>>>>> I'm doing classification of fMRI data using logistic regression. I've >>>>>>> been playing around with things for the past couple days and was >>>>>>> getting accuracies right around or slightly above chance, which was >>>>>>> disappointing. >>>>>>> Initially, my code looked a bit like this: >>>>>> >>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", >>>>>>> LogisticRegression())]) >>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean() >>>>>>> print acc >>>>>> >>>>>>> 0.358599857854 >>>>>> >>>>>>> Labels are an int in [1, 4] specifying which fmri run each sample came >>>>>>> from, and y has three classes. >>>>>> >>>>>>> When I went to inspect the predictions being made, though, I realized >>>>>>> in each split one class was almost completely dominating: >>>>>> >>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>> for train, test in cv: >>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>> LogisticRegression())]) >>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >>>>>> >>>>>>> [58 0 11] >>>>>>> [67 0 3] >>>>>>> [ 0 70 0] >>>>>>> [ 0 67 0] >>>>>> >>>>>>> Which doesn't seem right at all. I realized that if I disregard the >>>>>>> labels and just run 5-fold cross validation, though, the balance of >>>>>>> predictions looks much more like what I would expect: >>>>>> >>>>>>> cv = KFold(len(y), 5) >>>>>>> for train, test in cv: >>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>> LogisticRegression())]) >>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0] >>>>>> >>>>>>> [22 16 17] >>>>>>> [25 14 16] >>>>>>> [17 25 13] >>>>>>> [36 6 13] >>>>>>> [37 9 10] >>>>>> >>>>>>> (Although note the still relative dominance of the first class). When >>>>>>> I go back and run the full analysis this way, I get accuracies more in >>>>>>> line with what I would have expected from previous fMRI studies in >>>>>>> this domain. >>>>>> >>>>>>> My design is slow event-related, so my samples should be independent >>>>>>> at least as far as HRF-blurring is considered. >>>>>> >>>>>>> I'm not considering error trials so the number of samples for each >>>>>>> class is not perfectly balanced, but participants are near ceiling and >>>>>>> thus they are very close: >>>>>> >>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>> for train, test in cv: >>>>>>> print histogram(y[train], 3)[0] >>>>>> >>>>>>> [71 67 69] >>>>>>> [71 68 67] >>>>>>> [70 69 67] >>>>>>> [70 69 70] >>>>>> >>>>>> >>>>>>> Apologies for the long explanation. Two questions, really: >>>>>> >>>>>>> 1) Does it look like I'm doing anything obviously wrong? >>>>>> >>>>>>> 2) If not, can you help me build some intuition about why this is >>>>>>> happening and what it means? Or suggest things I could look at in my >>>>>>> data/code to identify the source of the problem? >>>>>> >>>>>>> I really appreciate it! Aside from this befuddling issue, I've found >>>>>>> scikit-learn an absolute delight to use! >>>>>> >>>>>>> Best, >>>>>>> Michael >>>>>> -- >>>>>> =------------------------------------------------------------------= >>>>>> Keep in touch www.onerussian.com >>>>>> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Try before you buy = See our experts in action! >>>>>> The most comprehensive online learning library for Microsoft developers >>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Try before you buy = See our experts in action! >>>>> The most comprehensive online learning library for Microsoft developers >>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> ------------------------------------------------------------------------------ >>>> Try before you buy = See our experts in action! >>>> The most comprehensive online learning library for Microsoft developers >>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>> http://p.sf.net/sfu/learndevnow-dev2 >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> ------------------------------------------------------------------------------ >>> Try before you buy = See our experts in action! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-dev2 >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
