Hi,

if you have the category "car" as shown in your example, this would effectively 
be something like

BMW=0
Toyota=1
Audi=2

Sure, the algorithm will execute just fine on the feature column with values in 
{0, 1, 2}. However, the problem is that it will come up with binary rules like 
x_i>= 0.5, x_i>= 1.5, and x_i>= 2.5. I.e., it will treat it is a continuous 
variable. 

What you can do is to encode this feature via one-hot encoding -- basically 
extend it into 2 (or 3) binary variables. This has it's own problems (if you 
have a feature with many possible values, you will end up with a large number 
of binary variables, and they may dominate in the resulting tree over other 
feature variables).

In any case, I guess this is what 

> "scikit-learn implementation does not support categorical variables for now". 


means ;).

Best,
Sebastian

> On Sep 13, 2019, at 9:38 PM, C W <tmrs...@gmail.com> wrote:
> 
> Hello all,
> I'm very confused. Can the decision tree module handle both continuous and 
> categorical features in the dataset? In this case, it's just CART 
> (Classification and Regression Trees).
> 
> For example,
> Gender Age Income  Car   Attendance
> Male     30   10000   BMW          Yes
> Female 35     9000  Toyota          No
> Male     50   12000    Audi           Yes
> 
> According to the documentation 
> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>  it can not! 
> 
> It says: "scikit-learn implementation does not support categorical variables 
> for now". 
> 
> Is this true? If not, can someone point me to an example? If yes, what do 
> people do?
> 
> Thank you very much!
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to