> Traditionally tree based methods are very good when it comes to categorical 
> variables and can handle them appropriately. There is a current WIP PR to add 
> this support to sklearn.

I think it's also important to distinguish between nominal and ordinal; it can 
make a huge difference imho. I.e., treating ordinal variables like continuous 
variable probably makes more sense than one-hot encoding them. Looking forward 
to the PR  :)

> On Jul 21, 2017, at 2:52 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> 
> Just to throw some additional ideas in here. Based on a conversation with a 
> colleague some time ago, I think learning classifier systems 
> (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly 
> useful when working with large, sparse binary vectors (like from a one-hot 
> encoding). I am really not into LCS's, and only know the basics (read through 
> the first chapters of the Intro to Learning Classifier Systems draft; the 
> print version will be out later this year). 
> Also, I saw an interesting poster on a Set Covering Machine algorithm once, 
> which they benchmarked against SVMs, random forests and the like for 
> categorical (genomics data). Looked promising.
> 
> Best,
> Sebastian
> 
> 
>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.mark...@gmail.com> wrote:
>> 
>> Thank you, Jacob. Appreciate it.
>> 
>> Regarding 'perform better', I was referring to better accuracy, precision, 
>> recall, F1 score, etc.
>> 
>> Thanks,
>> Raga
>> 
>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreibe...@gmail.com> 
>> wrote:
>> Traditionally tree based methods are very good when it comes to categorical 
>> variables and can handle them appropriately. There is a current WIP PR to 
>> add this support to sklearn. I'm not exactly sure what you mean that 
>> "perform better" though. Estimators that ignore the categorical aspect of 
>> these variables and treat them as discrete will likely perform worse than 
>> those that treat them appropriately.
>> 
>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.mark...@gmail.com> wrote:
>> Hello,
>> 
>> I am wondering if there are some classifiers that perform better for 
>> datasets with categorical features (converted into sparse input matrix with 
>> pd.get_dummies())? The data for the categorical features are nominal (order 
>> doesn't matter, e.g. country, occupation, etc).
>> 
>> If you could provide me some references (papers, books, website, etc), that 
>> would be great.
>> 
>> Thank you very much!
>> Raga
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to