As soon as you use non-binary integer numbers (e.g., {1, 2, 3, 4}, {5, 6, 7, 8}) you’d transform the nominal letters into ordinal variables. Assuming that you want to keep the variables on a nominal scale, I currently, I can’t think of a good way around one-hot encoding. Since you can use a sparse matrix representation, the “explosion” is actually not too bad ;)
Best, Sebastian > On Aug 14, 2015, at 9:01 AM, federico vaggi <vaggi.feder...@gmail.com> wrote: > > Hi, > > Simple example: > > Let's say that I have a binary classification task, and my input vector > consists of two disjunct sects of categorical variables - something like: > > X1 = {'a', 'b', 'c', 'd'} and X2 = {'e', 'd', 'b', 'f'} > > The order within the sets does not matter (obviously), but it matters that > the elements of X1 are conceptually separate from those of X2. > > All the categorical variables come from the same set. > > Is there a clever encoding that: > > - Emphasizes that order within each set does not matter > - Avoids explosion with one-hot encoding everything? > > Federico > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general