That's an excellent discussion! I've always wondered how other tools like R handled naturally categorical variables or not. LightGBM has a scikit-learn API which handles categorical features by inputting their columns names (or indexes): ``` import lightgbm lgb=lightgbm.LGBMClassifier() lgb.fit(*X*, *y*, *feature_name=... *, *categorical_feature=... *)
``` Where: - feature_name (list of strings or 'auto', optional (default='auto')) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used. - categorical_feature (list of strings or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). As a suggestion, Scikit-Learn could add a `categorical_feature` parameter in the tree-based estimators in order to work on the same way. On Fri, May 1, 2020 at 12:54 PM C W <tmrs...@gmail.com> wrote: > Thank you for the link, Guilaumme. In my particular case, I am working on > random forest classification. > > The notebook seems great. I will have to go through it in detail. I'm > still fairly new at using sklearn. > > Thank you for everyone's quick response, always feeling loved on here! :) > > > > On Fri, May 1, 2020 at 4:00 AM Guillaume Lemaître <g.lemaitr...@gmail.com> > wrote: > >> OrdinalEncoder is the equivalent of pd.factorize and will work in the >> scikit-learn ecosystem. >> >> However, be aware that you should not just swap OneHotEncoder to >> OrdinalEncoder just at your wish. >> It depends of your machine learning pipeline. >> >> As mentioned by Gael, tree-based algorithm will be fine with >> OrdinalEncoder. If you have a linear model, >> then you need to use the OneHotEncoder if the categories do not have any >> order. >> >> I will just refer to one notebook that we taught in EuroScipy last year: >> >> https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb >> >> On Fri, 1 May 2020 at 05:11, C W <tmrs...@gmail.com> wrote: >> >>> Hermes, >>> >>> That's an interesting function. Does it work with sklearn after >>> factorize? Is there any example? Thanks! >>> >>> On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales < >>> paisanoher...@hotmail.com> wrote: >>> >>>> Perhaps pd.factorize could hello? >>>> >>>> Obtener Outlook para Android <https://aka.ms/ghei36> >>>> >>>> ------------------------------ >>>> *From:* scikit-learn <scikit-learn-bounces+paisanohermes= >>>> hotmail....@python.org> on behalf of Gael Varoquaux < >>>> gael.varoqu...@normalesup.org> >>>> *Sent:* Thursday, April 30, 2020 5:12:06 PM >>>> *To:* Scikit-learn mailing list <scikit-learn@python.org> >>>> *Subject:* Re: [scikit-learn] Why does sklearn require >>>> one-hot-encoding for categorical features? Can we have a "factor" data >>>> type? >>>> >>>> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: >>>> > I've used R and Stata software, none needs such transformation. They >>>> have a >>>> > data type called "factors", which is different from "numeric". >>>> >>>> > My problem with OHE: >>>> > One-hot-encoding results in large number of features. This really >>>> blows up >>>> > quickly. And I have to fight curse of dimensionality with PCA >>>> reduction. That's >>>> > not cool! >>>> >>>> Most statistical models still not one-hot encoding behind the hood. So, >>>> R >>>> and stata do it too. >>>> >>>> Typically, tree-based models can be adapted to work directly on >>>> categorical data. Ours don't. It's work in progress. >>>> >>>> G >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> >>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0 >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn