On Thu, Mar 22, 2012 at 2:11 AM, Olivier Grisel
<[email protected]> wrote:

> Something that bothers me though, is that with libsvm, C=1 or C=10
> seems to be a reasonable default that work well both for dataset with
> size n_samples=100 and n_samples=10000 (by playing with the range of
> datasets available in the scikit).  On the other hand alpha would have
> to be grid searched systematically:
>
> It is also my gut feeling that dividing the regularization term by
> n_samples make the optimal value *more* dependent on the dataset size
> rather that the opposite. That might be the reason why C is not scaled
> in the SVM literature. Off course I might be wrong as I have not done
> any kind of systematic cross-datasets analysis.

There are an important factor to be considered
as far as optimal C / sample size dependency is concerned:

When the sample size is small the variance in model selection
is higher; thus a higher level of regularization is required.

This means a lower scaled_C is required in

Penalization + scaled_C * TotalLoss / n_samples

Then in

unscaled_C = (scaled_C / n_samples)

both denominator and numerator tends to decrease simultaneously
an this justifies the fact that optimal unscaled_C tends to remain
constant with respect to sample size.

That sais I'm +1 on always using scaled_C in the interface
for two reasons:

1) the objective function is formally and conceptually indipendent
    on the sample size (as it should IMHO).

2) using the optimal scaled_C canculated on smaller sample
    (CV sample size) with respect to the train size
    means using a scaled_C slightly biased toward a more parsimonious
    model (higher model bias). And that is what is often suggested.

Paolo

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to