As soon as you use non-binary integer numbers (e.g., {1, 2, 3, 4}, {5, 6, 7,
8}) you’d transform the nominal letters into ordinal variables. Assuming that
you want to keep the variables on a nominal scale, I currently, I can’t think
of a good way around one-hot encoding. Since you can use a sparse matrix
representation, the “explosion” is actually not too bad ;)
Best,
Sebastian
> On Aug 14, 2015, at 9:01 AM, federico vaggi <[email protected]> wrote:
>
> Hi,
>
> Simple example:
>
> Let's say that I have a binary classification task, and my input vector
> consists of two disjunct sects of categorical variables - something like:
>
> X1 = {'a', 'b', 'c', 'd'} and X2 = {'e', 'd', 'b', 'f'}
>
> The order within the sets does not matter (obviously), but it matters that
> the elements of X1 are conceptually separate from those of X2.
>
> All the categorical variables come from the same set.
>
> Is there a clever encoding that:
>
> - Emphasizes that order within each set does not matter
> - Avoids explosion with one-hot encoding everything?
>
> Federico
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general