Re: [scikit-learn] A necessary feature for Decision trees

Andreas Mueller Thu, 04 Jan 2018 10:47:14 -0800

Your contribution would be very welcome, I think the current work hasstalled.


On 01/04/2018 10:02 AM, Julio Antonio Soto de Vicente wrote:

Hi Yang Li,
I have to agree with you. Bitset and/or one hot encoding are justhacks which should not be necessary for decision tree learners.
There is some WIP on an implementation for natural handling ofcategorical features in trees: please take a look athttps://github.com/scikit-learn/scikit-learn/pull/4899
Cheers!

--
Julio
El 4 ene 2018, a las 9:06, 李扬 <[email protected]<mailto:[email protected]>> escribió:
Dear J.B.,

Thanks for your advice!
Yeah, I have considered using bitstring or sequence number, but theproblem is the algorithm not the representation of categorical data.Take the regression tree as an example, the algorithm in sklearn finda split value of the feature, and find the best split by computingthe minimal impurity of child nodes.However, find a split of the categorical feature is not thatmeaningful even though u represent it as continuous value, and thesplit result is partially depends on how u permute the value incategorical feature, which is not very persuasive.Instead, in the CART algorithm, *u should separate each category inthe feature from others and compute the impurity of the two sets.Then find the best separation strategy with the minimal impurity.*Obviously, this separation process can`t be finished by currentalgorithm which simply use the split method on continuous value.
One more possible shortcoming is the categorical feature can`t beproperly visualized. when forming a tree graph, it`s hard to getinformation from the categorical feature node while u just split it.
Thank you for your time!
Best wishes.




--
顺颂时祺！

*
*
李扬
上海交通大学 电子信息 与 电气工程 学院
电话：18818212371
地址：上海市闵行区东川路800号
邮编：200240

Yang Li  +86 188 1821 2371
Shanghai Jiao Tong University
School of Electronic，Information and Electrical Engineering F1203026
800 Dongchuan Road, Minhang District, Shanghai 200240
At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn"<[email protected] <mailto:[email protected]>> wrote:
    Dear Yang Li,

    > Neither the classificationTree nor the regressionTree supports
    categorical feature. That means the Decision trees model can only
    accept continuous feature.

    Consider either manually encoding your categories in bitstrings
    (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or
    using OneHotEncoder to do the same thing for you automatically.

    Cheers,
    J.B.



_______________________________________________
scikit-learn mailing list
[email protected] <mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] A necessary feature for Decision trees

Reply via email to