I've stayed quiet in this discussion because I was busy elsewhere. The
good thing is that it has allowed me to hear to point of view of
different people. Here is mine.

First, the decision we took can be undone. It is not final, and the way
that it should be taken is to make our user's life easiest, i.e. to make
things work as often as possible.

In my opinion, the way to do this is indeed to have a loss that is
invariant to the number of samples, i.e. _not_ to follow libsvm.

I'd like to stress that I don't think that following libsvm is much of a
goal per se. I understand that it make the life of someone like James
easier, because he knows libsvm well and can relate to it. But
libsvm/liblinear are not gold standards, and they are many things that
I'd like to change in them if I could. For instance liblinear penalizes
the intercept, which, from my point of view seems nonsensical. Libsvm
and liblinear do not agree on whether multi-class should be done with one
versus rest or one versus one. Yesterday during a meeting with Francis
Bach, he starting bashing liblinear saying that they had made the wrong
choice (I actually don't remember why). My point is: implementations will
never be perfect (ours won't either), we need to feel free to improve
them by taking some liberties.

Actually, if we are going to debate about the exact value that the
parameter should take, let me tell you my point of view from an abstract,
user-centric aspect: it is meaningless that when I use logistic
regression, bigger C means less regularization, whereas when I use lasso,
bigger alpha means more regularzation. As someone who has spent a little
while doing statistical learning, understand the reasons behind this, but
it really a nuisance for non experts. All this to say that we should take
the good decision _regardless_ of what libsvm and liblinear do.

I believe that the right choice is to have a ratio between the loss and
the penalization invariance on the number of samples. From a theoretical
perspective, I believe that this is the case because the loss is the
plugin estimate of a risk. Such estimate should not grow with the number
of sample. From a practical point of view, I believe that this is the
right choice because if I learn to set C on a dataset, and you give me a
new dataset saying it comes from the same source/feed, I should be able
to use the same C. In practice, the reason why Alex found this problem
was because on real life data he had difficulties setting C.

That said, I agree with James that the docs should be much more
explicit about what is going on, and how what we have differs from
libsvm. To come back to the initial discussion, it is entitled
"documentation inaccuracy", and I think that this is a fair criticism and
summary of the problem.

My 2 cents,

Gael

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to