On a separate note, what do you use for plotting? I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib?
Thanks! On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka <m...@sebastianraschka.com> wrote: > Yeah, think of it more as a computational workaround for achieving the > same thing more efficiently (although it looks inelegant/weird)-- something > like that wouldn't be mentioned in textbooks. > > Best, > Sebastian > > > On Oct 4, 2019, at 6:33 PM, C W <tmrs...@gmail.com> wrote: > > > > Thanks Sebastian, I think I get it. > > > > It's just have never seen it this way. Quite different from what I'm > used in Elements of Statistical Learning. > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka < > m...@sebastianraschka.com> wrote: > > Not sure if there's a website for that. In any case, to explain this > differently, as discussed earlier sklearn assumes continuous features for > decision trees. So, it will use a binary threshold for splitting along a > feature attribute. In other words, it cannot do sth like > > > > if x == 1 then right child node > > else left child node > > > > Instead, what it does is > > > > if x >= 0.5 then right child node > > else left child node > > > > These are basically equivalent as you can see when you just plug in > values 0 and 1 for x. > > > > Best, > > Sebastian > > > > > On Oct 4, 2019, at 5:34 PM, C W <tmrs...@gmail.com> wrote: > > > > > > I don't understand your answer. > > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or less > than? Does sklearn website have a working example on categorical input? > > > > > > Thanks! > > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < > m...@sebastianraschka.com> wrote: > > > Like Nicolas said, the 0.5 is just a workaround but will do the right > thing on the one-hot encoded variables, here. You will find that the > threshold is always at 0.5 for these variables. I.e., what it will do is to > use the following conversion: > > > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > > treat as car_Audi=0 if car_Audi < 0.5 > > > > > > or, it may be > > > > > > treat as car_Audi=1 if car_Audi > 0.5 > > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > > > (Forgot which one sklearn is using, but either way. it will be fine.) > > > > > > Best, > > > Sebastian > > > > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <nio...@gmail.com> wrote: > > >> > > >> > > >>> But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing something > wrong? > > >> > > >> You're not doing anything wrong, and neither is the tree. Trees don't > support categorical variables in sklearn, so everything is treated as > numerical. > > >> > > >> This is why we do one-hot-encoding: so that a set of numerical (one > hot encoded) features can be treated as if they were just one categorical > feature. > > >> > > >> > > >> > > >> Nicolas > > >> > > >> On 10/4/19 2:01 PM, C W wrote: > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo > on my part. > > >>> > > >>> Looks like I did one-hot-encoding correctly. My new variable names > are: car_Audi, car_BMW, etc. > > >>> > > >>> But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing something > wrong? > > >>> > > >>> Is there a good toy example on the sklearn website? I am only see > this: > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > . > > >>> > > >>> Thanks! > > >>> > > >>> > > >>> > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > m...@sebastianraschka.com> wrote: > > >>> Hi, > > >>> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 > and 1.5 > > >>> > > >>> that's not a onehot encoding then. > > >>> > > >>> For an Audi datapoint, it should be > > >>> > > >>> BMW=0 > > >>> Toyota=0 > > >>> Audi=1 > > >>> > > >>> for BMW > > >>> > > >>> BMW=1 > > >>> Toyota=0 > > >>> Audi=0 > > >>> > > >>> and for Toyota > > >>> > > >>> BMW=0 > > >>> Toyota=1 > > >>> Audi=0 > > >>> > > >>> The split threshold should then be at 0.5 for any of these features. > > >>> > > >>> Based on your email, I think you were assuming that the DT does the > one-hot encoding internally, which it doesn't. In practice, it is hard to > guess what is a nominal and what is a ordinal variable, so you have to do > the onehot encoding before you give the data to the decision tree. > > >>> > > >>> Best, > > >>> Sebastian > > >>> > > >>>> On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com> wrote: > > >>>> > > >>>> I'm getting some funny results. I am doing a regression decision > tree, the response variables are assigned to levels. > > >>>> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category. > > >>>> > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? > How does the sklearn know internally 0 vs. 1 is categorical, not numerical? > > >>>> > > >>>> In R for instance, you do as.factor(), which explicitly states the > data type. > > >>>> > > >>>> Thank you! > > >>>> > > >>>> > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3k...@gmail.com> > wrote: > > >>>> > > >>>> > > >>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote: > > >>>>> > > >>>>> > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com> wrote: > > >>>>> Thanks, Guillaume. > > >>>>> Column transformer looks pretty neat. I've also heard though, this > pipeline can be tedious to set up? Specifying what you want for every > feature is a pain. > > >>>>> > > >>>>> It would be interesting for us which part of the pipeline is > tedious to set up to know if we can improve something there. > > >>>>> Do you mean, that you would like to automatically detect of which > type of feature (categorical/numerical) and apply a > > >>>>> default encoder/scaling such as discuss there: > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > >>>>> > > >>>>> IMO, one a user perspective, it would be cleaner in some cases at > the cost of applying blindly a black box > > >>>>> which might be dangerous. > > >>>> Also see > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > >>>> Which basically does that. > > >>>> > > >>>> > > >>>>> > > >>>>> > > >>>>> Jaiver, > > >>>>> Actually, you guessed right. My real data has only one numerical > variable, looks more like this: > > >>>>> > > >>>>> Gender Date Income Car Attendance > > >>>>> Male 2019/3/01 10000 BMW Yes > > >>>>> Female 2019/5/02 9000 Toyota No > > >>>>> Male 2019/7/15 12000 Audi Yes > > >>>>> > > >>>>> I am predicting income using all other categorical variables. > Maybe it is catboost! > > >>>>> > > >>>>> Thanks, > > >>>>> > > >>>>> M > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlo...@ende.cc> > wrote: > > >>>>> If you have datasets with many categorical features, and perhaps > many categories, the tools in sklearn are quite limited, > > >>>>> but there are alternative implementations of boosted trees that > are designed with categorical features in mind. Take a look > > >>>>> at catboost [1], which has an sklearn-compatible API. > > >>>>> > > >>>>> J > > >>>>> > > >>>>> [1] https://catboost.ai/ > > >>>>> > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrs...@gmail.com> wrote: > > >>>>> Hello all, > > >>>>> I'm very confused. Can the decision tree module handle both > continuous and categorical features in the dataset? In this case, it's just > CART (Classification and Regression Trees). > > >>>>> > > >>>>> For example, > > >>>>> Gender Age Income Car Attendance > > >>>>> Male 30 10000 BMW Yes > > >>>>> Female 35 9000 Toyota No > > >>>>> Male 50 12000 Audi Yes > > >>>>> > > >>>>> According to the documentation > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, > it can not! > > >>>>> > > >>>>> It says: "scikit-learn implementation does not support categorical > variables for now". > > >>>>> > > >>>>> Is this true? If not, can someone point me to an example? If yes, > what do people do? > > >>>>> > > >>>>> Thank you very much! > > >>>>> > > >>>>> > > >>>>> > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn@python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn@python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn@python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Guillaume Lemaitre > > >>>>> INRIA Saclay - Parietal team > > >>>>> Center for Data Science Paris-Saclay > > >>>>> https://glemaitre.github.io/ > > >>>>> > > >>>>> > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> > > >>>>> scikit-learn@python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn@python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn@python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn@python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> > > >>> scikit-learn@python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn@python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn