Thanks a lot for this detailed answer!
Kind regards,
Kevin

Le 14/03/2014 16:37, Olivier Grisel a écrit :
> 2014-03-14 15:34 GMT+01:00 Kevin Keraudren <kevin.keraudre...@imperial.ac.uk>:
>> Hi,
>>
>> I have a question related to the range of my input data for SVM or
>> Random Forests for classification:
>> I normalise my input vectors so that their euclidean norm is one, for
>> instance to limit the influence of the image size or intensity contrast.
>> I took the habit of then scaling them, multiplying them by a factor 1000
>> so that I have values between 0 and 1000 instead of 0 and 1, and thus
>> less values "close to zero". I guess it does not hurt to do so, but
>> would you know if it is useful? Do the SVM and Random Forests already do
>> some normalisation before starting to learn the data?
> Random Forest (and decision tree-based models in general) are scale 
> independent.
>
> SVMs are very sensitive to scaling in the sense that all features
> should vary in the same ranges. The actual width of the ranges should
> not matter much as long as it's does not cause numerical stability
> issues (both the 0-1 range and the 0-1000 ranges should work) and that
> you grid search hyperparameters such as C and gamma for their optimal
> values:
>
> http://scikit-learn.org/stable/model_selection.html
>
> You can use sklearn.preprocessing.StandardScaler to center the data
> (mean feature values are 0) and have each feature have a standard
> deviation of 1. Scaling between 0 and 1 works well too. This is
> implemented by MinMaxScaler. More discussion in the doc:
>
> http://scikit-learn.org/stable/modules/preprocessing.html
>
> I don't see any reason why the 0-1000 range would work better than the
> 0-1 range.
>
>> I have a similar questions for the Random Forests for regression: how is
>> the minimal MSE required for a split define? Here again, if I scale my
>> input by a factor 1000, shall I expect the resulting trees to be
>> different (excluding the random aspect of Random Forests)?
> The decision to stop splitting in a tree is controlled by:
>
> - max_depth
> - min_samples_leaf
> - min_samples_split
>
> Otherwise, the regression trees are fully developed.
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to