Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Sebastian Raschka Fri, 04 Oct 2019 12:48:06 -0700

Like Nicolas said, the 0.5 is just a workaround but will do the right thing on 
the one-hot encoded variables, here. You will find that the threshold is always 
at 0.5 for these variables. I.e., what it will do is to use the following 
conversion:


treat as car_Audi=1 if car_Audi >= 0.5
treat as car_Audi=0 if car_Audi < 0.5

or, it may be

treat as car_Audi=1 if car_Audi > 0.5
treat as car_Audi=0 if car_Audi <= 0.5

(Forgot which one sklearn is using, but either way. it will be fine.)

Best,
Sebastian


> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <[email protected]> wrote:
> 
> 
>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
> 
> You're not doing anything wrong, and neither is the tree. Trees don't support 
> categorical variables in sklearn, so everything is treated as numerical.
> 
> This is why we do one-hot-encoding: so that a set of numerical (one hot 
> encoded) features can be treated as if they were just one categorical feature.
> 
> 
> 
> Nicolas
> 
> On 10/4/19 2:01 PM, C W wrote:
>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my 
>> part.
>> 
>> Looks like I did one-hot-encoding correctly. My new variable names are: 
>> car_Audi, car_BMW, etc.
>> 
>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>> 
>> Is there a good toy example on the sklearn website? I am only see this: 
>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 
>> <https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html>.
>> 
>> Thanks!
>> 
>> 
>> 
>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi,
>> 
>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
>> 
>> that's not a onehot encoding then.
>> 
>> For an Audi datapoint, it should be
>> 
>> BMW=0
>> Toyota=0
>> Audi=1
>> 
>> for BMW
>> 
>> BMW=1
>> Toyota=0
>> Audi=0
>> 
>> and for Toyota
>> 
>> BMW=0
>> Toyota=1
>> Audi=0
>> 
>> The split threshold should then be at 0.5 for any of these features.
>> 
>> Based on your email, I think you were assuming that the DT does the one-hot 
>> encoding internally, which it doesn't. In practice, it is hard to guess what 
>> is a nominal and what is a ordinal variable, so you have to do the onehot 
>> encoding before you give the data to the decision tree.
>> 
>> Best,
>> Sebastian
>> 
>>> On Oct 4, 2019, at 11:48 AM, C W <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I'm getting some funny results. I am doing a regression decision tree, the 
>>> response variables are assigned to levels.
>>> 
>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>> Audi=2) as numerical values, not category.
>>> 
>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does 
>>> the sklearn know internally 0 vs. 1 is categorical, not numerical? 
>>> 
>>> In R for instance, you do as.factor(), which explicitly states the data 
>>> type.
>>> 
>>> Thank you!
>>> 
>>> 
>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>>> 
>>>> 
>>>> On Sat, 14 Sep 2019 at 20:59, C W <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Thanks, Guillaume. 
>>>> Column transformer looks pretty neat. I've also heard though, this 
>>>> pipeline can be tedious to set up? Specifying what you want for every 
>>>> feature is a pain.
>>>> 
>>>> It would be interesting for us which part of the pipeline is tedious to 
>>>> set up to know if we can improve something there.
>>>> Do you mean, that you would like to automatically detect of which type of 
>>>> feature (categorical/numerical) and apply a
>>>> default encoder/scaling such as discuss there: 
>>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>>>  
>>>> <https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127>
>>>> 
>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost 
>>>> of applying blindly a black box
>>>> which might be dangerous.
>>> Also see 
>>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>>>  
>>> <https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor>
>>> Which basically does that.
>>> 
>>> 
>>>>  
>>>> 
>>>> Jaiver,
>>>> Actually, you guessed right. My real data has only one numerical variable, 
>>>> looks more like this:
>>>> 
>>>> Gender Date            Income  Car   Attendance
>>>> Male     2019/3/01   10000   BMW          Yes
>>>> Female 2019/5/02    9000   Toyota          No
>>>> Male     2019/7/15   12000    Audi           Yes
>>>> 
>>>> I am predicting income using all other categorical variables. Maybe it is 
>>>> catboost!
>>>> 
>>>> Thanks,
>>>> 
>>>> M
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <[email protected]> 
>>>> <mailto:[email protected]> wrote:
>>>> If you have datasets with many categorical features, and perhaps many 
>>>> categories, the tools in sklearn are quite limited, 
>>>> but there are alternative implementations of boosted trees that are 
>>>> designed with categorical features in mind. Take a look
>>>> at catboost [1], which has an sklearn-compatible API.
>>>> 
>>>> J
>>>> 
>>>> [1] https://catboost.ai/ <https://catboost.ai/>
>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Hello all,
>>>> I'm very confused. Can the decision tree module handle both continuous and 
>>>> categorical features in the dataset? In this case, it's just CART 
>>>> (Classification and Regression Trees).
>>>> 
>>>> For example,
>>>> Gender Age Income  Car   Attendance
>>>> Male     30   10000   BMW          Yes
>>>> Female 35     9000  Toyota          No
>>>> Male     50   12000    Audi           Yes
>>>> 
>>>> According to the documentation 
>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart
>>>>  
>>>> <https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart>,
>>>>  it can not! 
>>>> 
>>>> It says: "scikit-learn implementation does not support categorical 
>>>> variables for now". 
>>>> 
>>>> Is this true? If not, can someone point me to an example? If yes, what do 
>>>> people do?
>>>> 
>>>> Thank you very much!
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> https://mail.python.org/mailman/listinfo/scikit-learn 
>>>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> https://mail.python.org/mailman/listinfo/scikit-learn 
>>>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> https://mail.python.org/mailman/listinfo/scikit-learn 
>>>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>>> 
>>>> 
>>>> -- 
>>>> Guillaume Lemaitre
>>>> INRIA Saclay - Parietal team
>>>> Center for Data Science Paris-Saclay
>>>> https://glemaitre.github.io/ <https://glemaitre.github.io/>
>>>> 
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> https://mail.python.org/mailman/listinfo/scikit-learn 
>>>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://mail.python.org/mailman/listinfo/scikit-learn 
>>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://mail.python.org/mailman/listinfo/scikit-learn 
>>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Reply via email to