Hey guys, This is a bit of a complicated question.
I was helping my friend do a task with Pandas/sklearn for her data science class. I figured it'll be a breeze, since I'm fancy-pancy Python programmer. Oh wow, it was so not. I was trying to do things that felt simple to me, but there were so many problems, I spent 2 hours and only had a partial solution. I'm wondering whether I'm missing something. She got a CSV with lots of data about cars. Some of the data had missing values (marked with "?"). Additionally, some columns had small numbers written as strings like "one", "two", "three", etc. There were maybe a few more issues like these. The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value. I could easily write my own logic that does that, but she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three". They gave me so, so many problems. For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe. (Either changing in-place or creating a new DataFrame.) I think I spent an hour on this problem alone. Eventually I found a way <https://www.dropbox.com/preview/Desktop/Shani/floof.py>, but it definitely felt like I was doing something wrong, like this is supposed to be simpler. Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there. Well, I wanted to first convert them to real numbers and then use the same SimpleImputer to fix these. But I couldn't, because of the exception. Any insight you could give me would be useful. Thanks, Ram.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn