Hi Yang Li,

I have to agree with you. Bitset and/or one hot encoding are just hacks which 
should not be necessary for decision tree learners.

There is some WIP on an implementation for natural handling of categorical 
features in trees: please take a look at 
https://github.com/scikit-learn/scikit-learn/pull/4899

Cheers!

--
Julio

> El 4 ene 2018, a las 9:06, 李扬 <sky188133...@163.com> escribió:
> 
> Dear J.B.,
> 
> Thanks for your advice!
> 
> Yeah, I have considered using bitstring or sequence number, but the problem 
> is the algorithm not the representation of categorical data.
> Take the regression tree as an example, the algorithm in sklearn find a split 
> value of the feature, and find the best split by computing the minimal 
> impurity of child nodes.
> However, find a split of the categorical feature is not that meaningful even 
> though u represent it as continuous value, and the split result is partially 
> depends on how u permute the value in categorical  feature, which is not very 
> persuasive.
> Instead, in the CART algorithm, u should separate each category in the 
> feature from others and compute the impurity of the two sets. Then find the 
> best separation strategy with the minimal impurity.
> Obviously, this separation process can`t be finished by current algorithm 
> which simply use the split method on continuous value.
> 
> One more possible shortcoming is the categorical feature can`t be properly 
> visualized. when forming a tree graph, it`s hard to get information from the 
> categorical feature node while u just split it.
> 
> Thank you for your time!
> Best wishes.
> 
> 
> 
> 
> --
> 顺颂时祺!
> 
> 
> 李扬 
> 上海交通大学  电子信息 与 电气工程 学院  
> 电话:18818212371
> 地址:上海市闵行区东川路800号
> 邮编:200240
> 
> Yang Li  +86 188 1821 2371
> Shanghai Jiao Tong University
> School of Electronic,Information and Electrical Engineering F1203026
> 800 Dongchuan Road, Minhang District, Shanghai 200240
> 
> 
>  
> 
> At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" 
> <scikit-learn@python.org> wrote:
> Dear Yang Li,
> 
> > Neither the classificationTree nor the regressionTree supports categorical 
> > feature. That means the Decision trees model can only accept continuous 
> > feature. 
> 
> Consider either manually encoding your categories in bitstrings (e.g., 
> "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to 
> do the same thing for you automatically.
> 
> Cheers,
> J.B.
> 
> 
>  
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to