I've stayed quiet in this discussion because I was busy elsewhere. The good thing is that it has allowed me to hear to point of view of different people. Here is mine.
First, the decision we took can be undone. It is not final, and the way that it should be taken is to make our user's life easiest, i.e. to make things work as often as possible. In my opinion, the way to do this is indeed to have a loss that is invariant to the number of samples, i.e. _not_ to follow libsvm. I'd like to stress that I don't think that following libsvm is much of a goal per se. I understand that it make the life of someone like James easier, because he knows libsvm well and can relate to it. But libsvm/liblinear are not gold standards, and they are many things that I'd like to change in them if I could. For instance liblinear penalizes the intercept, which, from my point of view seems nonsensical. Libsvm and liblinear do not agree on whether multi-class should be done with one versus rest or one versus one. Yesterday during a meeting with Francis Bach, he starting bashing liblinear saying that they had made the wrong choice (I actually don't remember why). My point is: implementations will never be perfect (ours won't either), we need to feel free to improve them by taking some liberties. Actually, if we are going to debate about the exact value that the parameter should take, let me tell you my point of view from an abstract, user-centric aspect: it is meaningless that when I use logistic regression, bigger C means less regularization, whereas when I use lasso, bigger alpha means more regularzation. As someone who has spent a little while doing statistical learning, understand the reasons behind this, but it really a nuisance for non experts. All this to say that we should take the good decision _regardless_ of what libsvm and liblinear do. I believe that the right choice is to have a ratio between the loss and the penalization invariance on the number of samples. From a theoretical perspective, I believe that this is the case because the loss is the plugin estimate of a risk. Such estimate should not grow with the number of sample. From a practical point of view, I believe that this is the right choice because if I learn to set C on a dataset, and you give me a new dataset saying it comes from the same source/feed, I should be able to use the same C. In practice, the reason why Alex found this problem was because on real life data he had difficulties setting C. That said, I agree with James that the docs should be much more explicit about what is going on, and how what we have differs from libsvm. To come back to the initial discussion, it is entitled "documentation inaccuracy", and I think that this is a fair criticism and summary of the problem. My 2 cents, Gael ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
