Re: [Scikit-learn-general] Dealing with Categorical Variable in Random Forest

Michael Eickenberg Wed, 29 Oct 2014 09:19:45 -0700

Hi Xin,

as far as I know the only ways of working around this problem right now are
one-hot encoding or using integer numbers to represent your classes. The
former augments your feature space but can cause biases if different
categorical features can take different numbers of values (leading to more
columns for one feature, leading to it being selected disproportionately
often). The latter avoids the problem of the former, but since decisions
are binary, the trees can only distinguish integer features from a certain
depth onwards.


I cannot comment on future developments, but I have the feeling that better
treatment of categorical features may be on the plan :)

Michael

On Wed, Oct 29, 2014 at 5:09 PM, Xin Shuai <[email protected]> wrote:

> Hi,:
>  I'm a fan of Scikit-learn and it is my favorite ML package.
>  However, I found this package DOES NOT deal with categorical variable for
> tree-based method. So I need to convert categorical variable into dummy
> variable before I can use tree method. Actually, this is counterintuitive
> to the original decision tree method.
> Any improvement on that?
> --
> Xin(David) Shuai
> PhD of Complex System in School of Informatics & Computing
> Indiana University Bloomington
> 812-606-8969
>
> The way to success is to do as much as important things, and as less as
> unimportant things, as you can...
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Dealing with Categorical Variable in Random Forest

Reply via email to