Also, I was just reading the paper you referred to. It makes what seem to me to be a series of somehwat strawman arguments against 1 of n encoding.
First, actual practice often involves Euclidean distances between points on a sphere S^n rather than than unrestricted points in R^n. This helps quite a lot. Another vein of usage is to embed points using 1 of n coding and then embedding points based on cooccurrence in a user history matrix. Euclidean distance works well there as well. Neither of these approaches is addressed in the justification of your paper. I haven't read enough or thought enough to talk about your methods yet. On Sun, Jun 2, 2013 at 3:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > So Florents, can you say how this works better than 1 of n coding and then > using a simple scaled Euclidean metric? > > Beyond that, how would this scale? > > > > > On Sun, Jun 2, 2013 at 2:39 PM, Florents Tselai <tse...@dmst.aueb.gr>wrote: > >> I've noticed (correct me if I'm wrong) that mahout lacks algorithms >> specialized in clustering data with categorical attributes. >> >> Would the community be interested in the implementation of algorithms like >> ROCK <http://www.cis.upenn.edu/~sudipto/mypapers/categorical.pdf> ? >> >> I'm currently working on this area (applied-research project) and I'd like >> to have my code open-sourced. >> > >