thanks lars.

this would mean that any tree-based model could generate differences based
on preprocessing differences right?

cheers,

satra

On Sun, Mar 16, 2014 at 3:37 PM, Olivier Grisel <olivier.gri...@ensta.org>wrote:

> 2014-03-16 0:23 GMT+01:00 Lars Buitinck <larsm...@gmail.com>:
> > 2014-03-15 21:53 GMT+01:00 Satrajit Ghosh <sa...@mit.edu>:
> >> in many cases with fat data (small samples<50 x many features>100000) i
> have
> >> found that standardizing helps quite a bit in case of extra trees. i
> still
> >> don't have a good understanding as to why this is the case. it could
> simply
> >> be small sample bias that i am seeing. but extra trees are also
> supposed to
> >> be resilient to overfitting.
> >>
> >> any thoughts?
> >
> > IIRC the scikit-learn tree learner discards a candidate split when the
> > difference between samples along the feature under consideration is
> > less than 1e-7. When you standardize, the learner might see a
> > different set of potential splits, and in particular, features with
> > extremely small variance wouldn't yield a split at all without
> > standardizing.
> >
> > I think you should be able the check the number of candidate splits
> > for a feature j with
> >
> >     np.sum(np.diff(np.sort(X[:, j])) >= 1e-7)
>
> That might be a better explanation as standardizing is a linear
> transform (actually affine but the intercept disappears in the
> subtraction) so should not impact the uniform sampling outcome.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to