Re: [Scikit-learn-general] Causes for one class dominating?

Michael Waskom Sun, 29 Jan 2012 13:39:21 -0800

Aha, this does indeed suggest something strange:

http://web.mit.edu/mwaskom/www/pca.png


I'm going to dig into this some more, but I don't really have any
strong intuitions to guide me here so if anything pops out at you from
that do feel free to speak up :)

Michael

On Sun, Jan 29, 2012 at 1:14 PM, Alexandre Gramfort
<[email protected]> wrote:
> hum...
>
> final suggestion: I would try to visualize a 2D or 3D PCA to see if it
> can give me some intuition on what's happening.
>
> Alex
>
> On Sun, Jan 29, 2012 at 9:58 PM, Michael Waskom <[email protected]> wrote:
>> Hi Alex,
>>
>> See my response to Yarick for some results from a binary
>> classification.  I reran both the three-way and binary classification
>> with SVC, though, with similar results:
>>
>> cv = LeaveOneLabelOut(bin_labels)
>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>> print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean()
>> for train, test in cv:
>>  pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>>  print histogram(pipe.fit(bin_X[train],
>> bin_y[train]).predict(bin_X[test]), 2)[0]
>>
>> 0.496377606851
>> [ 0 68]
>> [ 0 70]
>> [ 0 67]
>> [ 0 69]
>>
>> cv = LeaveOneLabelOut(tri_labels)
>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>> print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean()
>> for train, test in cv:
>>   pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>>   print histogram(pipe.fit(tri_X[train],
>> tri_y[train]).predict(tri_X[test]), 3)[0]
>>
>> 0.386755821732
>> [20  0 48]
>> [29  1 40]
>> [ 2  0 65]
>> [ 0 69  0]
>>
>> On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort
>> <[email protected]> wrote:
>>> ok
>>>
>>> some more suggestions:
>>>
>>> - do you observe the same behavior with SVC which uses a different
>>> multiclass strategy?
>>> - what do you see when you inspect results obtained with binary
>>> predictions (keeping 2 classes at a time)?
>>>
>>> Alex
>>>
>>> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom <[email protected]> 
>>> wrote:
>>>> Hi Alex,
>>>>
>>>> No, each subject has four runs so I'm doing leave-one-run-out cross
>>>> validation in the original case. I'm estimating separate models within
>>>> each subject (as is common in fmri) so all my example code here would
>>>> be from within a for subject in subjects: loop, but this pattern of
>>>> weirdness is happening in every subject I've looked at so far.
>>>>
>>>> Michael
>>>>
>>>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort
>>>> <[email protected]> wrote:
>>>>> hi,
>>>>>
>>>>> just a thought. You seem to be doing inter-subject prediction. In this 
>>>>> case
>>>>> a 5 fold mixes subjects. A hint is that you may have a subject effect that
>>>>> acts as a confound.
>>>>>
>>>>> again just a thought ready the email quickly
>>>>>
>>>>> Alex
>>>>>
>>>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> 
>>>>> wrote:
>>>>>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
>>>>>> list, but figured one at a time :)
>>>>>>
>>>>>> The scikit-learn LogisticRegression class uses one-vs-all in a
>>>>>> multiclass setting, although I also tried it with their one-vs-one
>>>>>> metaclassifier with similar "weird" results.
>>>>>>
>>>>>> Interestingly, though, I think the multiclass setting is a red
>>>>>> herring.  For this dataset we also have a two-class condition (you can
>>>>>> think of the paradigm as a 3x2 design, although we're analyzing them
>>>>>> separately), which has the same thing happening:
>>>>>>
>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>>> for train, test in cv:
>>>>>>   pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>> LogisticRegression())])
>>>>>>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>>>
>>>>>> 0.496377606851
>>>>>> [ 0 68]
>>>>>> [ 0 70]
>>>>>> [ 0 67]
>>>>>> [ 0 69]
>>>>>>
>>>>>> cv = LeaveOneLabelOut(np.random.permutation(labels))
>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>> LogisticRegression())])
>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>>> for train, test in cv:
>>>>>>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>>>
>>>>>> 0.532455733754
>>>>>> [40 28]
>>>>>> [36 34]
>>>>>> [33 34]
>>>>>> [31 38]
>>>>>>
>>>>>> Best,
>>>>>> Michael
>>>>>>
>>>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko 
>>>>>> <[email protected]> wrote:
>>>>>>> just to educate myself -- how sklearn does multiclass decisions in this
>>>>>>> case?  if it is all pairs classification + voting, then the answer is
>>>>>>> simple -- ties, and the "first one in order" would take all those.
>>>>>>>
>>>>>>> but if there is no ties involved then, theoretically (since not sure if 
>>>>>>> it is
>>>>>>> applicable to your data) it is easy to come up with non-linear 
>>>>>>> scenarios for
>>>>>>> binary classification where 1 class would be better classified than the 
>>>>>>> other
>>>>>>> one with a linear classifier...  e.g. here is an example (sorry -- 
>>>>>>> pymvpa) with
>>>>>>> an embedded normal (i.e. both classes mean at the same spot but have
>>>>>>> significantly different variances)
>>>>>>>
>>>>>>>    from mvpa2.suite import *
>>>>>>>    ns, nf = 100, 10
>>>>>>>    ds = dataset_wizard(
>>>>>>>        np.vstack((
>>>>>>>            np.random.normal(size=(ns, nf)),
>>>>>>>            10*np.random.normal(size=(ns, nf)))),
>>>>>>>        targets=['narrow']*ns + ['wide']*ns,
>>>>>>>        chunks=[0,1]*ns)
>>>>>>>    cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>>>>>>>                         enable_ca=['stats'])
>>>>>>>    cv(ds).samples
>>>>>>>    print cv.ca.stats
>>>>>>>
>>>>>>> yields
>>>>>>>
>>>>>>>    ----------.
>>>>>>>    predictions\targets  narrow   wide
>>>>>>>                `------  ------  ------  P'  N' FP FN  PPV  NPV  TPR  
>>>>>>> SPC  FDR  MCC  AUC
>>>>>>>           narrow         100      74   174  26 74  0 0.57   1    1  
>>>>>>> 0.26 0.43 0.39 0.41
>>>>>>>            wide           0       26    26 174  0 74   1  0.57 0.26   1 
>>>>>>>    0  0.39 0.41
>>>>>>>    Per target:          ------  ------
>>>>>>>             P            100     100
>>>>>>>             N            100     100
>>>>>>>             TP           100      26
>>>>>>>             TN            26     100
>>>>>>>    Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79 0.63 
>>>>>>> 0.63 0.21 0.39 0.41
>>>>>>>           CHI^2         123.04 p=1.7e-26
>>>>>>>            ACC           0.63
>>>>>>>            ACC%           63
>>>>>>>         # of sets         2
>>>>>>>
>>>>>>>
>>>>>>> I bet with a bit of creativity, classifier-dependent cases of similar
>>>>>>> cases could be found for linear underlying models.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote:
>>>>>>>
>>>>>>>> Hi Folks,
>>>>>>>
>>>>>>>> I hope you don't mind a question that's a mix of general machine
>>>>>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>>>>>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
>>>>>>>> perspective so I thought it best to ask here first.
>>>>>>>
>>>>>>>> I'm doing classification of fMRI data using logistic regression.  I've
>>>>>>>> been playing around with things for the past couple days and was
>>>>>>>> getting accuracies right around or slightly above chance, which was
>>>>>>>> disappointing.
>>>>>>>> Initially, my code looked a bit like this:
>>>>>>>
>>>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>> LogisticRegression())])
>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>>>>>>>> print acc
>>>>>>>
>>>>>>>> 0.358599857854
>>>>>>>
>>>>>>>> Labels are an int in [1, 4] specifying which fmri run each sample came
>>>>>>>> from, and y has three classes.
>>>>>>>
>>>>>>>> When I went to inspect the predictions being made, though, I realized
>>>>>>>> in each split one class was almost completely dominating:
>>>>>>>
>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>> for train, test in cv:
>>>>>>>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>> LogisticRegression())])
>>>>>>>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 
>>>>>>>> 3)[0]
>>>>>>>
>>>>>>>> [58  0 11]
>>>>>>>> [67  0  3]
>>>>>>>> [ 0 70  0]
>>>>>>>> [ 0 67  0]
>>>>>>>
>>>>>>>> Which doesn't seem right at all.  I realized that if I disregard the
>>>>>>>> labels and just run 5-fold cross validation, though, the balance of
>>>>>>>> predictions looks much more like what I would expect:
>>>>>>>
>>>>>>>> cv = KFold(len(y), 5)
>>>>>>>> for train, test in cv:
>>>>>>>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>> LogisticRegression())])
>>>>>>>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 
>>>>>>>> 3)[0]
>>>>>>>
>>>>>>>> [22 16 17]
>>>>>>>> [25 14 16]
>>>>>>>> [17 25 13]
>>>>>>>> [36  6 13]
>>>>>>>> [37  9 10]
>>>>>>>
>>>>>>>> (Although note the still relative dominance of the first class).  When
>>>>>>>> I go back and run the full analysis this way, I get accuracies more in
>>>>>>>> line with what I would have expected from previous fMRI studies in
>>>>>>>> this domain.
>>>>>>>
>>>>>>>> My design is slow event-related, so my samples should be independent
>>>>>>>> at least as far as HRF-blurring is considered.
>>>>>>>
>>>>>>>> I'm not considering error trials so the number of samples for each
>>>>>>>> class is not perfectly balanced, but participants are near ceiling and
>>>>>>>> thus they are very close:
>>>>>>>
>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>> for train, test in cv:
>>>>>>>>     print histogram(y[train], 3)[0]
>>>>>>>
>>>>>>>> [71 67 69]
>>>>>>>> [71 68 67]
>>>>>>>> [70 69 67]
>>>>>>>> [70 69 70]
>>>>>>>
>>>>>>>
>>>>>>>> Apologies for the long explanation.  Two questions, really:
>>>>>>>
>>>>>>>> 1) Does it look like I'm doing anything obviously wrong?
>>>>>>>
>>>>>>>> 2) If not, can you help me build some intuition about why this is
>>>>>>>> happening and what it means? Or suggest things I could look  at in my
>>>>>>>> data/code to identify the source of the problem?
>>>>>>>
>>>>>>>> I really appreciate it!  Aside from this befuddling issue, I've found
>>>>>>>> scikit-learn an absolute delight to use!
>>>>>>>
>>>>>>>> Best,
>>>>>>>> Michael
>>>>>>> --
>>>>>>> =------------------------------------------------------------------=
>>>>>>> Keep in touch                                     www.onerussian.com
>>>>>>> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Try before you buy = See our experts in action!
>>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Try before you buy = See our experts in action!
>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Try before you buy = See our experts in action!
>>>>> The most comprehensive online learning library for Microsoft developers
>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Try before you buy = See our experts in action!
>>>> The most comprehensive online learning library for Microsoft developers
>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>> ------------------------------------------------------------------------------
>>> Try before you buy = See our experts in action!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-dev2
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to