Hi Yang Li, I have to agree with you. Bitset and/or one hot encoding are just hacks which should not be necessary for decision tree learners.
There is some WIP on an implementation for natural handling of categorical features in trees: please take a look at https://github.com/scikit-learn/scikit-learn/pull/4899 Cheers! -- Julio > El 4 ene 2018, a las 9:06, 李扬 <sky188133...@163.com> escribió: > > Dear J.B., > > Thanks for your advice! > > Yeah, I have considered using bitstring or sequence number, but the problem > is the algorithm not the representation of categorical data. > Take the regression tree as an example, the algorithm in sklearn find a split > value of the feature, and find the best split by computing the minimal > impurity of child nodes. > However, find a split of the categorical feature is not that meaningful even > though u represent it as continuous value, and the split result is partially > depends on how u permute the value in categorical feature, which is not very > persuasive. > Instead, in the CART algorithm, u should separate each category in the > feature from others and compute the impurity of the two sets. Then find the > best separation strategy with the minimal impurity. > Obviously, this separation process can`t be finished by current algorithm > which simply use the split method on continuous value. > > One more possible shortcoming is the categorical feature can`t be properly > visualized. when forming a tree graph, it`s hard to get information from the > categorical feature node while u just split it. > > Thank you for your time! > Best wishes. > > > > > -- > 顺颂时祺! > > > 李扬 > 上海交通大学 电子信息 与 电气工程 学院 > 电话:18818212371 > 地址:上海市闵行区东川路800号 > 邮编:200240 > > Yang Li +86 188 1821 2371 > Shanghai Jiao Tong University > School of Electronic,Information and Electrical Engineering F1203026 > 800 Dongchuan Road, Minhang District, Shanghai 200240 > > > > > At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" > <scikit-learn@python.org> wrote: > Dear Yang Li, > > > Neither the classificationTree nor the regressionTree supports categorical > > feature. That means the Decision trees model can only accept continuous > > feature. > > Consider either manually encoding your categories in bitstrings (e.g., > "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to > do the same thing for you automatically. > > Cheers, > J.B. > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn