Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion:
treat as car_Audi=1 if car_Audi >= 0.5 treat as car_Audi=0 if car_Audi < 0.5 or, it may be treat as car_Audi=1 if car_Audi > 0.5 treat as car_Audi=0 if car_Audi <= 0.5 (Forgot which one sklearn is using, but either way. it will be fine.) Best, Sebastian > On Oct 4, 2019, at 1:44 PM, Nicolas Hug <nio...@gmail.com> wrote: > > >> But, decision tree is still mistaking one-hot-encoding as numerical input >> and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > > You're not doing anything wrong, and neither is the tree. Trees don't support > categorical variables in sklearn, so everything is treated as numerical. > > This is why we do one-hot-encoding: so that a set of numerical (one hot > encoded) features can be treated as if they were just one categorical feature. > > > > Nicolas > > On 10/4/19 2:01 PM, C W wrote: >> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my >> part. >> >> Looks like I did one-hot-encoding correctly. My new variable names are: >> car_Audi, car_BMW, etc. >> >> But, decision tree is still mistaking one-hot-encoding as numerical input >> and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >> >> Is there a good toy example on the sklearn website? I am only see this: >> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html >> <https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html>. >> >> Thanks! >> >> >> >> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <m...@sebastianraschka.com >> <mailto:m...@sebastianraschka.com>> wrote: >> Hi, >> >>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, >>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >> >> that's not a onehot encoding then. >> >> For an Audi datapoint, it should be >> >> BMW=0 >> Toyota=0 >> Audi=1 >> >> for BMW >> >> BMW=1 >> Toyota=0 >> Audi=0 >> >> and for Toyota >> >> BMW=0 >> Toyota=1 >> Audi=0 >> >> The split threshold should then be at 0.5 for any of these features. >> >> Based on your email, I think you were assuming that the DT does the one-hot >> encoding internally, which it doesn't. In practice, it is hard to guess what >> is a nominal and what is a ordinal variable, so you have to do the onehot >> encoding before you give the data to the decision tree. >> >> Best, >> Sebastian >> >>> On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com >>> <mailto:tmrs...@gmail.com>> wrote: >>> >>> I'm getting some funny results. I am doing a regression decision tree, the >>> response variables are assigned to levels. >>> >>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, >>> Audi=2) as numerical values, not category. >>> >>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does >>> the sklearn know internally 0 vs. 1 is categorical, not numerical? >>> >>> In R for instance, you do as.factor(), which explicitly states the data >>> type. >>> >>> Thank you! >>> >>> >>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3k...@gmail.com >>> <mailto:t3k...@gmail.com>> wrote: >>> >>> >>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote: >>>> >>>> >>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com >>>> <mailto:tmrs...@gmail.com>> wrote: >>>> Thanks, Guillaume. >>>> Column transformer looks pretty neat. I've also heard though, this >>>> pipeline can be tedious to set up? Specifying what you want for every >>>> feature is a pain. >>>> >>>> It would be interesting for us which part of the pipeline is tedious to >>>> set up to know if we can improve something there. >>>> Do you mean, that you would like to automatically detect of which type of >>>> feature (categorical/numerical) and apply a >>>> default encoder/scaling such as discuss there: >>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>> >>>> <https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127> >>>> >>>> IMO, one a user perspective, it would be cleaner in some cases at the cost >>>> of applying blindly a black box >>>> which might be dangerous. >>> Also see >>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>> >>> <https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor> >>> Which basically does that. >>> >>> >>>> >>>> >>>> Jaiver, >>>> Actually, you guessed right. My real data has only one numerical variable, >>>> looks more like this: >>>> >>>> Gender Date Income Car Attendance >>>> Male 2019/3/01 10000 BMW Yes >>>> Female 2019/5/02 9000 Toyota No >>>> Male 2019/7/15 12000 Audi Yes >>>> >>>> I am predicting income using all other categorical variables. Maybe it is >>>> catboost! >>>> >>>> Thanks, >>>> >>>> M >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlo...@ende.cc> >>>> <mailto:jlo...@ende.cc> wrote: >>>> If you have datasets with many categorical features, and perhaps many >>>> categories, the tools in sklearn are quite limited, >>>> but there are alternative implementations of boosted trees that are >>>> designed with categorical features in mind. Take a look >>>> at catboost [1], which has an sklearn-compatible API. >>>> >>>> J >>>> >>>> [1] https://catboost.ai/ <https://catboost.ai/> >>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrs...@gmail.com >>>> <mailto:tmrs...@gmail.com>> wrote: >>>> Hello all, >>>> I'm very confused. Can the decision tree module handle both continuous and >>>> categorical features in the dataset? In this case, it's just CART >>>> (Classification and Regression Trees). >>>> >>>> For example, >>>> Gender Age Income Car Attendance >>>> Male 30 10000 BMW Yes >>>> Female 35 9000 Toyota No >>>> Male 50 12000 Audi Yes >>>> >>>> According to the documentation >>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart >>>> >>>> <https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart>, >>>> it can not! >>>> >>>> It says: "scikit-learn implementation does not support categorical >>>> variables for now". >>>> >>>> Is this true? If not, can someone point me to an example? If yes, what do >>>> people do? >>>> >>>> Thank you very much! >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org <mailto:scikit-learn@python.org> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> <https://mail.python.org/mailman/listinfo/scikit-learn> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org <mailto:scikit-learn@python.org> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> <https://mail.python.org/mailman/listinfo/scikit-learn> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org <mailto:scikit-learn@python.org> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> <https://mail.python.org/mailman/listinfo/scikit-learn> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> INRIA Saclay - Parietal team >>>> Center for Data Science Paris-Saclay >>>> https://glemaitre.github.io/ <https://glemaitre.github.io/> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org <mailto:scikit-learn@python.org> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> <https://mail.python.org/mailman/listinfo/scikit-learn> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org <mailto:scikit-learn@python.org> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> <https://mail.python.org/mailman/listinfo/scikit-learn> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org <mailto:scikit-learn@python.org> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> <https://mail.python.org/mailman/listinfo/scikit-learn> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org <mailto:scikit-learn@python.org> >> https://mail.python.org/mailman/listinfo/scikit-learn >> <https://mail.python.org/mailman/listinfo/scikit-learn> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org <mailto:scikit-learn@python.org> >> https://mail.python.org/mailman/listinfo/scikit-learn >> <https://mail.python.org/mailman/listinfo/scikit-learn> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn