Re: [scikit-learn] transform categorical data to numerical representation

Georg Heiler Sun, 06 Aug 2017 03:33:18 -0700

@sebastian: thanks. Indeed, I am aware of this problem.

I developed something here:
https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but
realized that the performance of prediction is pretty lame when there are
around 100-150 columns used as the input.
Do you have some ideas how to speed this up?


Regards,
Georg

Joel Nothman <[email protected]> schrieb am So., 6. Aug. 2017 um
00:49 Uhr:

> We are working on CategoricalEncoder in
> https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more
> with this kind of thing. Feedback and testing is welcome.
>
> On 6 August 2017 at 02:13, Sebastian Raschka <[email protected]> wrote:
>
>> Hi, Georg,
>>
>> I bring this up every time here on the mailing list :), and you probably
>> aware of this issue, but it makes a difference whether your categorical
>> data is nominal or ordinal. For instance if you have an ordinal variable
>> like with values like {small, medium, large} you probably want to encode it
>> as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your
>> domain knowledge regarding the variable. If you have sth like {blue, red,
>> green} it may make more sense to do a one-hot encoding so that the
>> classifier doesn't assume  a relationship between the variables like blue >
>> red > green or sth like that.
>>
>> Now, the DictVectorizer and OneHotEncoder are both doing one hot
>> encoding. The LabelEncoder does convert a variable to integer values, but
>> if you have sth like {small, medium, large}, it wouldn't know the order (if
>> that's an ordinal variable) and it would just assign arbitrary integers in
>> increasing order. Thus, if you are dealing ordinal variables, there's no
>> way around doing this manually; for example you could create mapping
>> dictionaries for that (most conveniently done in pandas).
>>
>> Best,
>> Sebastian
>>
>> > On Aug 5, 2017, at 5:10 AM, Georg Heiler <[email protected]>
>> wrote:
>> >
>> > Hi,
>> >
>> > the LabelEncooder is only meant for a single column i.e. target
>> variable. Is the DictVectorizeer or a manual chaining of multiple
>> LabelEncoders (one per categorical column) the desired way to get values
>> which can be fed into a subsequent classifier?
>> >
>> > Is there some way I have overlooked which works better and possibly
>> also can handle unseen values by applying most frequent imputation?
>> >
>> > regards,
>> > Georg
>> > _______________________________________________
>> > scikit-learn mailing list
>> > [email protected]
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] transform categorical data to numerical representation

Reply via email to