Re: Algorithms for categorical data

Ted Dunning Sun, 02 Jun 2013 12:34:39 -0700

Also, I was just reading the paper you referred to.  It makes what seem to
me to be a series of somehwat strawman arguments against 1 of n encoding.

First, actual practice often involves Euclidean distances between points on
a sphere S^n rather than than unrestricted points in R^n.  This helps quite
a lot.

Another vein of usage is to embed points using 1 of n coding and then
embedding points based on cooccurrence in a user history matrix.  Euclidean
distance works well there as well.

Neither of these approaches is addressed in the justification of your paper.

I haven't read enough or thought enough to talk about your methods yet.

On Sun, Jun 2, 2013 at 3:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> So Florents, can you say how this works better than 1 of n coding and then
> using a simple scaled Euclidean metric?
>
> Beyond that, how would this scale?
>
>
>
>
> On Sun, Jun 2, 2013 at 2:39 PM, Florents Tselai <tse...@dmst.aueb.gr>wrote:
>
>> I've noticed (correct me if I'm wrong) that mahout lacks algorithms
>> specialized in clustering data with categorical attributes.
>>
>> Would the community be interested in the implementation of algorithms like
>> ROCK <http://www.cis.upenn.edu/~sudipto/mypapers/categorical.pdf> ?
>>
>> I'm currently working on this area (applied-research project) and I'd like
>> to have my code open-sourced.
>>
>
>

Re: Algorithms for categorical data

Reply via email to