I'll check it out. Thank you. On Wed, Aug 19, 2020 at 9:46 AM Sole Galli via scikit-learn < scikit-learn@python.org> wrote:
> Did you have a look at the package feature-engine? It has its own imputers > and encoders that allow you to select the columns to transform and returns > a dataframe. It also has a sklear wrapper that wraps sklearn transformers > so that they return a dataframe instead of a numpy array. > > Cheers. > > Sole > > > Sent from ProtonMail mobile > > > > -------- Original Message -------- > On 18 Aug 2020, 13:56, Ram Rachum < r...@rachum.com> wrote: > > > > > On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <ke...@dataschool.io> wrote: > >> Hi Ram, >> >> These are great questions! >> > > Thank you for the detailed answers. > >> >> > The task was to remove these irregularities. So for the "?" items, >> replace them with mean, and for the "one", "two" etc. replace with a >> numerical value. >> >> If your primary task is "data cleaning", then pandas is usually the >> optimal tool. If "preprocessing your data for Machine Learning" is your >> primary task, then scikit-learn is usually the optimal tool. There is some >> overlap between what is considered "cleaning" and "preprocessing", but I >> mention this distinction because it can help you decide what tool to use. >> > > Okay, but here's one example where it gets tricky. For a column with > numbers written like "one", "two" and missing values "?", I had to do two > things: Change them to numbers (1, 2), and then, instead of the missing > values, add the most common element, or mean or whatever. When I tried to > use LabelEncoder to do the first part, it complained about the missing > values. I couldn't fix these missing values until the labels were changed > to ints. So that put me in a frustrating Catch-22 situation, and all the > while I'm thinking "It would be so much simpler to just write my own logic > in a for-loop rather than try to get Pandas and scikit-learn working > together. > > Any insights about that? > > >> > For one, I couldn't figure out how to apply SimpleImputer on just one >> column in the DataFrame, and then get the results in the form of a >> dataframe. >> >> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional >> input. In your case, this would be a 1-column DataFrame (such as >> df[['col']]) rather than a Series (such as df['col']). >> >> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy >> array. If you need the output to be a DataFrame, one option is to convert >> the array to a pandas object and concatenate it to the original DataFrame. >> > > Well, I did do that in the `process_column` helper function in the code I > linked to above. But it kind of felt like... What am I using a framework > for to begin with? Because that kind of logistics is the reason I want to > use a framework instead of managing my own arrays and imputing logic. > > Thanks for your help Kevin. > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn