On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <ke...@dataschool.io> wrote:
> Hi Ram, > > These are great questions! > Thank you for the detailed answers. > > > The task was to remove these irregularities. So for the "?" items, > replace them with mean, and for the "one", "two" etc. replace with a > numerical value. > > If your primary task is "data cleaning", then pandas is usually the > optimal tool. If "preprocessing your data for Machine Learning" is your > primary task, then scikit-learn is usually the optimal tool. There is some > overlap between what is considered "cleaning" and "preprocessing", but I > mention this distinction because it can help you decide what tool to use. > Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together. Any insights about that? > > For one, I couldn't figure out how to apply SimpleImputer on just one > column in the DataFrame, and then get the results in the form of a > dataframe. > > Like most scikit-learn transformers, SimpleImputer expects 2-dimensional > input. In your case, this would be a 1-column DataFrame (such as > df[['col']]) rather than a Series (such as df['col']). > > Also like most scikit-learn transformers, SimpleImputer outputs a NumPy > array. If you need the output to be a DataFrame, one option is to convert > the array to a pandas object and concatenate it to the original DataFrame. > Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic. Thanks for your help Kevin.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn