On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <ke...@dataschool.io> wrote:

> Hi Ram,
>
> These are great questions!
>

Thank you for the detailed answers.

>
> > The task was to remove these irregularities. So for the "?" items,
> replace them with mean, and for the "one", "two" etc. replace with a
> numerical value.
>
> If your primary task is "data cleaning", then pandas is usually the
> optimal tool. If "preprocessing your data for Machine Learning" is your
> primary task, then scikit-learn is usually the optimal tool. There is some
> overlap between what is considered "cleaning" and "preprocessing", but I
> mention this distinction because it can help you decide what tool to use.
>

Okay, but here's one example where it gets tricky. For a column with
numbers written like "one", "two" and missing values "?", I had to do two
things: Change them to numbers (1, 2), and then, instead of the missing
values, add the most common element, or mean or whatever. When I tried to
use LabelEncoder to do the first part, it complained about the missing
values. I couldn't fix these missing values until the labels were changed
to ints. So that put me in a frustrating Catch-22 situation, and all the
while I'm thinking "It would be so much simpler to just write my own logic
in a for-loop rather than try to get Pandas and scikit-learn working
together.

Any insights about that?


> > For one, I couldn't figure out how to apply SimpleImputer on just one
> column in the DataFrame, and then get the results in the form of a
> dataframe.
>
> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional
> input. In your case, this would be a 1-column DataFrame (such as
> df[['col']]) rather than a Series (such as df['col']).
>
> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy
> array. If you need the output to be a DataFrame, one option is to convert
> the array to a pandas object and concatenate it to the original DataFrame.
>

Well, I did do that in the `process_column` helper function in the code I
linked to above. But it kind of felt like... What am I using a framework
for to begin with? Because that kind of logistics is the reason I want to
use a framework instead of managing my own arrays and imputing logic.

Thanks for your help Kevin.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to