I will need to look into factorize. Here is the result from profiling the transform method on a single new observation https://codereview.stackexchange.com/q/171622/132999
Best Georg Sebastian Raschka <se.rasc...@gmail.com> schrieb am So. 6. Aug. 2017 um 20:39: > > performance of prediction is pretty lame when there are around 100-150 > columns used as the input. > > you are talking about computational performance when you are calling the > "transform" method? Have you done some profiling to find out where your > bottle neck (in the for loop) is? Just one a very quick look, I think this > > data.loc[~data[column].isin(fittedLabels), column] = > str(replacementForUnseen) > > is already very slow because fittedLabels is an array where you have O(n) > lookup instead of an average O(1) by using a hash table. Or is the isin > function converting it to a hashtable/set/dict? > > In general, would it maybe help to use pandas' factorize? > https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html > For predict time, say you have only 1 example for prediction that needs to > be converted, you could append prototypes of all possible values that could > occur, do the transformation, and then only pass the 1 transformed sample > to the classifier. I guess that could be even slow though ... > > Best, > Sebastian > > > On Aug 6, 2017, at 6:30 AM, Georg Heiler <georg.kf.hei...@gmail.com> > wrote: > > > > @sebastian: thanks. Indeed, I am aware of this problem. > > > > I developed something here: > https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but > realized that the performance of prediction is pretty lame when there are > around 100-150 columns used as the input. > > Do you have some ideas how to speed this up? > > > > Regards, > > Georg > > > > Joel Nothman <joel.noth...@gmail.com> schrieb am So., 6. Aug. 2017 um > 00:49 Uhr: > > We are working on CategoricalEncoder in > https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more > with this kind of thing. Feedback and testing is welcome. > > > > On 6 August 2017 at 02:13, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > Hi, Georg, > > > > I bring this up every time here on the mailing list :), and you probably > aware of this issue, but it makes a difference whether your categorical > data is nominal or ordinal. For instance if you have an ordinal variable > like with values like {small, medium, large} you probably want to encode it > as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your > domain knowledge regarding the variable. If you have sth like {blue, red, > green} it may make more sense to do a one-hot encoding so that the > classifier doesn't assume a relationship between the variables like blue > > red > green or sth like that. > > > > Now, the DictVectorizer and OneHotEncoder are both doing one hot > encoding. The LabelEncoder does convert a variable to integer values, but > if you have sth like {small, medium, large}, it wouldn't know the order (if > that's an ordinal variable) and it would just assign arbitrary integers in > increasing order. Thus, if you are dealing ordinal variables, there's no > way around doing this manually; for example you could create mapping > dictionaries for that (most conveniently done in pandas). > > > > Best, > > Sebastian > > > > > On Aug 5, 2017, at 5:10 AM, Georg Heiler <georg.kf.hei...@gmail.com> > wrote: > > > > > > Hi, > > > > > > the LabelEncooder is only meant for a single column i.e. target > variable. Is the DictVectorizeer or a manual chaining of multiple > LabelEncoders (one per categorical column) the desired way to get values > which can be fed into a subsequent classifier? > > > > > > Is there some way I have overlooked which works better and possibly > also can handle unseen values by applying most frequent imputation? > > > > > > regards, > > > Georg > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn