Hi Xin, as far as I know the only ways of working around this problem right now are one-hot encoding or using integer numbers to represent your classes. The former augments your feature space but can cause biases if different categorical features can take different numbers of values (leading to more columns for one feature, leading to it being selected disproportionately often). The latter avoids the problem of the former, but since decisions are binary, the trees can only distinguish integer features from a certain depth onwards.
I cannot comment on future developments, but I have the feeling that better treatment of categorical features may be on the plan :) Michael On Wed, Oct 29, 2014 at 5:09 PM, Xin Shuai <[email protected]> wrote: > Hi,: > I'm a fan of Scikit-learn and it is my favorite ML package. > However, I found this package DOES NOT deal with categorical variable for > tree-based method. So I need to convert categorical variable into dummy > variable before I can use tree method. Actually, this is counterintuitive > to the original decision tree method. > Any improvement on that? > -- > Xin(David) Shuai > PhD of Complex System in School of Informatics & Computing > Indiana University Bloomington > 812-606-8969 > > The way to success is to do as much as important things, and as less as > unimportant things, as you can... > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
