Re: [Scikit-learn-general] SVC documentation inaccuracy

David Warde-Farley Wed, 21 Mar 2012 17:10:13 -0700

On 2012-03-21, at 7:25 PM, Gael Varoquaux <[email protected]> wrote:


> I'd like to stress that I don't think that following libsvm is much of a
> goal per se. I understand that it make the life of someone like James
> easier, because he knows libsvm well and can relate to it. 

I think it's less about disagreeing with libsvm than disagreeing with the 
notation of every textbook presentation I know of. I agree that libsvm is no 
golden calf.

> libsvm and liblinear do not agree on whether multi-class should be done with 
> one versus rest or one versus one.

<side rant> In particular, doing 1 vs rest for logistic regression seems like 
an odd choice when there is a perfectly good multiclass generalization of 
logistic regression. Mathieu clarified to me last night how liblinear is 
calculating "probabilities" in the multiclass case, and it seems insane to me, 
from a calibration perspective (normalizing a bunch of things by their sum does 
not make them probabilities in any meaningful sense!).

> Actually, if we are going to debate about the exact value that the
> parameter should take, let me tell you my point of view from an abstract,
> user-centric aspect: it is meaningless that when I use logistic
> regression, bigger C means less regularization, whereas when I use lasso,
> bigger alpha means more regularzation. As someone who has spent a little
> while doing statistical learning, understand the reasons behind this, but
> it really a nuisance for non experts.

Agreed. It *still is* a nuisance for this quasi-expert. ;)

> I believe that the right choice is to have a ratio between the loss and
> the penalization invariance on the number of samples. From a theoretical
> perspective, I believe that this is the case because the loss is the
> plugin estimate of a risk. Such estimate should not grow with the number
> of sample. From a practical point of view, I believe that this is the
> right choice because if I learn to set C on a dataset, and you give me a
> new dataset saying it comes from the same source/feed, I should be able
> to use the same C. In practice, the reason why Alex found this problem
> was because on real life data he had difficulties setting C.
> 
> That said, I agree with James that the docs should be much more
> explicit about what is going on, and how what we have differs from
> libsvm.

I think that renaming sklearn's scaled version of "C" is probably a start. 
Using the name "C" for something other than what everyone else means by "C" 
violates the principle if least surprise quite severely. If they saw "zeta" or 
"Francis" or "unicorn", most people will not assume it is a moniker for C but 
refer to the documentation for an explanation.

David



------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] SVC documentation inaccuracy

Reply via email to