Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?

Hermes Morales Thu, 30 Apr 2020 15:52:23 -0700

Perhaps pd.factorize could hello?

Obtener Outlook para Android<https://aka.ms/ghei36>

________________________________
From: scikit-learn <[email protected]> 
on behalf of Gael Varoquaux <[email protected]>
Sent: Thursday, April 30, 2020 5:12:06 PM
To: Scikit-learn mailing list <[email protected]>
Subject: Re: [scikit-learn] Why does sklearn require one-hot-encoding for 
categorical features? Can we have a "factor" data type?

On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
> I've used R and Stata software, none needs such transformation. They have a
> data type called "factors", which is different from "numeric".

> My problem with OHE:
> One-hot-encoding results in large number of features. This really blows up
> quickly. And I have to fight curse of dimensionality with PCA reduction. 
> That's
> not cool!

Most statistical models still not one-hot encoding behind the hood. So, R
and stata do it too.

Typically, tree-based models can be adapted to work directly on
categorical data. Ours don't. It's work in progress.

G
_______________________________________________
scikit-learn mailing list
[email protected]
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&amp;data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&amp;sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&amp;reserved=0

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?

Reply via email to