Re: [Scikit-learn-general] Causes for one class dominating?

Yaroslav Halchenko Thu, 02 Feb 2012 00:26:30 -0800

Would it hold if you PCA it to two dimensions and visualize it of the same 
effects hold


Michael Waskom <[email protected]> wrote:

>Hi Alex,
>
>See my response to Yarick for some results from a binary
>classification.  I reran both the three-way and binary classification
>with SVC, though, with similar results:
>
>cv = LeaveOneLabelOut(bin_labels)
>pipe = Pipeline([("scale", Scaler()), ("classify",
>SVC(kernel="linear"))])
>print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean()
>for train, test in cv:
>pipe = Pipeline([("scale", Scaler()), ("classify",
>SVC(kernel="linear"))])
>  print histogram(pipe.fit(bin_X[train],
>bin_y[train]).predict(bin_X[test]), 2)[0]
>
>0.496377606851
>[ 0 68]
>[ 0 70]
>[ 0 67]
>[ 0 69]
>
>cv = LeaveOneLabelOut(tri_labels)
>pipe = Pipeline([("scale", Scaler()), ("classify",
>SVC(kernel="linear"))])
>print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean()
>for train, test in cv:
>  pipe = Pipeline([("scale", Scaler()), ("classify",
>SVC(kernel="linear"))])
>  print histogram(pipe.fit(tri_X[train],
>tri_y[train]).predict(tri_X[test]), 3)[0]
>
>0.386755821732
>[20  0 48]
>[29  1 40]
>[ 2  0 65]
>[ 0 69  0]
>
>On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort
><[email protected]> wrote:
>> ok
>>
>> some more suggestions:
>>
>> - do you observe the same behavior with SVC which uses a different
>> multiclass strategy?
>> - what do you see when you inspect results obtained with binary
>> predictions (keeping 2 classes at a time)?
>>
>> Alex
>>
>> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom
><[email protected]> wrote:
>>> Hi Alex,
>>>
>>> No, each subject has four runs so I'm doing leave-one-run-out cross
>>> validation in the original case. I'm estimating separate models
>within
>>> each subject (as is common in fmri) so all my example code here
>would
>>> be from within a for subject in subjects: loop, but this pattern of
>>> weirdness is happening in every subject I've looked at so far.
>>>
>>> Michael
>>>
>>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort
>>> <[email protected]> wrote:
>>>> hi,
>>>>
>>>> just a thought. You seem to be doing inter-subject prediction. In
>this case
>>>> a 5 fold mixes subjects. A hint is that you may have a subject
>effect that
>>>> acts as a confound.
>>>>
>>>> again just a thought ready the email quickly
>>>>
>>>> Alex
>>>>
>>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom
><[email protected]> wrote:
>>>>> Hi Yarick, thanks for chiming in! I thought about spamming the
>pymvpa
>>>>> list, but figured one at a time :)
>>>>>
>>>>> The scikit-learn LogisticRegression class uses one-vs-all in a
>>>>> multiclass setting, although I also tried it with their one-vs-one
>>>>> metaclassifier with similar "weird" results.
>>>>>
>>>>> Interestingly, though, I think the multiclass setting is a red
>>>>> herring.  For this dataset we also have a two-class condition (you
>can
>>>>> think of the paradigm as a 3x2 design, although we're analyzing
>them
>>>>> separately), which has the same thing happening:
>>>>>
>>>>> cv = LeaveOneLabelOut(labels)
>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>> for train, test in cv:
>>>>>   pipe = Pipeline([("scale", Scaler()), ("classify",
>LogisticRegression())])
>>>>>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]),
>2)[0]
>>>>>
>>>>> 0.496377606851
>>>>> [ 0 68]
>>>>> [ 0 70]
>>>>> [ 0 67]
>>>>> [ 0 69]
>>>>>
>>>>> cv = LeaveOneLabelOut(np.random.permutation(labels))
>>>>> pipe = Pipeline([("scale", Scaler()), ("classify",
>LogisticRegression())])
>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>> for train, test in cv:
>>>>>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]),
>2)[0]
>>>>>
>>>>> 0.532455733754
>>>>> [40 28]
>>>>> [36 34]
>>>>> [33 34]
>>>>> [31 38]
>>>>>
>>>>> Best,
>>>>> Michael
>>>>>
>>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko
><[email protected]> wrote:
>>>>>> just to educate myself -- how sklearn does multiclass decisions
>in this
>>>>>> case?  if it is all pairs classification + voting, then the
>answer is
>>>>>> simple -- ties, and the "first one in order" would take all
>those.
>>>>>>
>>>>>> but if there is no ties involved then, theoretically (since not
>sure if it is
>>>>>> applicable to your data) it is easy to come up with non-linear
>scenarios for
>>>>>> binary classification where 1 class would be better classified
>than the other
>>>>>> one with a linear classifier...  e.g. here is an example (sorry
>-- pymvpa) with
>>>>>> an embedded normal (i.e. both classes mean at the same spot but
>have
>>>>>> significantly different variances)
>>>>>>
>>>>>>    from mvpa2.suite import *
>>>>>>    ns, nf = 100, 10
>>>>>>    ds = dataset_wizard(
>>>>>>        np.vstack((
>>>>>>            np.random.normal(size=(ns, nf)),
>>>>>>            10*np.random.normal(size=(ns, nf)))),
>>>>>>        targets=['narrow']*ns + ['wide']*ns,
>>>>>>        chunks=[0,1]*ns)
>>>>>>    cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>>>>>>                         enable_ca=['stats'])
>>>>>>    cv(ds).samples
>>>>>>    print cv.ca.stats
>>>>>>
>>>>>> yields
>>>>>>
>>>>>>    ----------.
>>>>>>    predictions\targets  narrow   wide
>>>>>>                `------  ------  ------  P'  N' FP FN  PPV  NPV
> TPR  SPC  FDR  MCC  AUC
>>>>>>           narrow         100      74   174  26 74  0 0.57   1  
> 1  0.26 0.43 0.39 0.41
>>>>>>            wide           0       26    26 174  0 74   1  0.57
>0.26   1    0  0.39 0.41
>>>>>>    Per target:          ------  ------
>>>>>>             P            100     100
>>>>>>             N            100     100
>>>>>>             TP           100      26
>>>>>>             TN            26     100
>>>>>>    Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79
>0.63 0.63 0.21 0.39 0.41
>>>>>>           CHI^2         123.04 p=1.7e-26
>>>>>>            ACC           0.63
>>>>>>            ACC%           63
>>>>>>         # of sets         2
>>>>>>
>>>>>>
>>>>>> I bet with a bit of creativity, classifier-dependent cases of
>similar
>>>>>> cases could be found for linear underlying models.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote:
>>>>>>
>>>>>>> Hi Folks,
>>>>>>
>>>>>>> I hope you don't mind a question that's a mix of general machine
>>>>>>> learning and scikit-learn. I'm happy to kick it over to
>metaoptimize,
>>>>>>> but I'm not 100% sure I'm doing everything "right" from a
>scikit-learn
>>>>>>> perspective so I thought it best to ask here first.
>>>>>>
>>>>>>> I'm doing classification of fMRI data using logistic regression.
> I've
>>>>>>> been playing around with things for the past couple days and was
>>>>>>> getting accuracies right around or slightly above chance, which
>was
>>>>>>> disappointing.
>>>>>>> Initially, my code looked a bit like this:
>>>>>>
>>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify",
>LogisticRegression())])
>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>>>>>>> print acc
>>>>>>
>>>>>>> 0.358599857854
>>>>>>
>>>>>>> Labels are an int in [1, 4] specifying which fmri run each
>sample came
>>>>>>> from, and y has three classes.
>>>>>>
>>>>>>> When I went to inspect the predictions being made, though, I
>realized
>>>>>>> in each split one class was almost completely dominating:
>>>>>>
>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>> for train, test in cv:
>>>>>>>     pipe = Pipeline([("scale", Scaler()), ("classify",
>LogisticRegression())])
>>>>>>>     print histogram(pipe.fit(X[train],
>y[train]).predict(X[test]), 3)[0]
>>>>>>
>>>>>>> [58  0 11]
>>>>>>> [67  0  3]
>>>>>>> [ 0 70  0]
>>>>>>> [ 0 67  0]
>>>>>>
>>>>>>> Which doesn't seem right at all.  I realized that if I disregard
>the
>>>>>>> labels and just run 5-fold cross validation, though, the balance
>of
>>>>>>> predictions looks much more like what I would expect:
>>>>>>
>>>>>>> cv = KFold(len(y), 5)
>>>>>>> for train, test in cv:
>>>>>>>     pipe = Pipeline([("scale", Scaler()), ("classify",
>LogisticRegression())])
>>>>>>>     print histogram(pipe.fit(X[train],
>y[train]).predict(X[test]), 3)[0]
>>>>>>
>>>>>>> [22 16 17]
>>>>>>> [25 14 16]
>>>>>>> [17 25 13]
>>>>>>> [36  6 13]
>>>>>>> [37  9 10]
>>>>>>
>>>>>>> (Although note the still relative dominance of the first class).
> When
>>>>>>> I go back and run the full analysis this way, I get accuracies
>more in
>>>>>>> line with what I would have expected from previous fMRI studies
>in
>>>>>>> this domain.
>>>>>>
>>>>>>> My design is slow event-related, so my samples should be
>independent
>>>>>>> at least as far as HRF-blurring is considered.
>>>>>>
>>>>>>> I'm not considering error trials so the number of samples for
>each
>>>>>>> class is not perfectly balanced, but participants are near
>ceiling and
>>>>>>> thus they are very close:
>>>>>>
>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>> for train, test in cv:
>>>>>>>     print histogram(y[train], 3)[0]
>>>>>>
>>>>>>> [71 67 69]
>>>>>>> [71 68 67]
>>>>>>> [70 69 67]
>>>>>>> [70 69 70]
>>>>>>
>>>>>>
>>>>>>> Apologie
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to