Did you have a look at the package feature-engine? It has its own imputers and 
encoders that allow you to select the columns to transform and returns a 
dataframe. It also has a sklear wrapper that wraps sklearn transformers so that 
they return a dataframe instead of a numpy array.

Cheers.

Sole

Sent from ProtonMail mobile

-------- Original Message --------
On 18 Aug 2020, 13:56, Ram Rachum wrote:

> On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <ke...@dataschool.io> wrote:
>
>> Hi Ram,
>>
>> These are great questions!
>
> Thank you for the detailed answers.
>
>>> The task was to remove these irregularities. So for the "?" items, replace 
>>> them with mean, and for the "one", "two" etc. replace with a numerical 
>>> value.
>>
>> If your primary task is "data cleaning", then pandas is usually the optimal 
>> tool. If "preprocessing your data for Machine Learning" is your primary 
>> task, then scikit-learn is usually the optimal tool. There is some overlap 
>> between what is considered "cleaning" and "preprocessing", but I mention 
>> this distinction because it can help you decide what tool to use.
>
> Okay, but here's one example where it gets tricky. For a column with numbers 
> written like "one", "two" and missing values "?", I had to do two things: 
> Change them to numbers (1, 2), and then, instead of the missing values, add 
> the most common element, or mean or whatever. When I tried to use 
> LabelEncoder to do the first part, it complained about the missing values. I 
> couldn't fix these missing values until the labels were changed to ints. So 
> that put me in a frustrating Catch-22 situation, and all the while I'm 
> thinking "It would be so much simpler to just write my own logic in a 
> for-loop rather than try to get Pandas and scikit-learn working together.
>
> Any insights about that?
>
>>> For one, I couldn't figure out how to apply SimpleImputer on just one 
>>> column in the DataFrame, and then get the results in the form of a 
>>> dataframe.
>>
>> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional 
>> input. In your case, this would be a 1-column DataFrame (such as 
>> df[['col']]) rather than a Series (such as df['col']).
>>
>> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy 
>> array. If you need the output to be a DataFrame, one option is to convert 
>> the array to a pandas object and concatenate it to the original DataFrame.
>
> Well, I did do that in the `process_column` helper function in the code I 
> linked to above. But it kind of felt like... What am I using a framework for 
> to begin with? Because that kind of logistics is the reason I want to use a 
> framework instead of managing my own arrays and imputing logic.
>
> Thanks for your help Kevin.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to