If I may attempt to clarify I think - indeed, it makes no sense to have a vector whose elements are 'string valued', nor can I think of any mapping to doubles that has any use here.
What he is really after is clustering strings like they are vectors themselves, not elements of another vector. The question is, how much do we need to be able to think of strings like vectors to make the algorithm work? We need a distance metric and he's suggesting Levenshtein, which seems OK at first glance. (It satisfied the triangle inequality ... I think?) Centroids are just strings that are a similar number of edits away from another set of strings. Distances are discrete, does that matter though? Anything else that doesn't map? Haven't thought about it a lot but don't yet see why k-means couldn't let you cluster strings. In the CF code I do something similar for arbitrary 'items' so that hints to me that a well behaved distance metric is all you need? Of course, the code wouldn't quite work as-is to perform this. One would need to probably modify it a lot. For what it is worth... you could actually get the TreeClusteringRecommender class to cluster you strings with just a little work. I am not sure if it implements the algorithm you want. It is also not distributed. Sean On Sep 1, 2009 5:14 PM, "Ted Dunning" <[email protected]> wrote: That particular trick wouldn't work because you are losing the essence of real numbers with this step. If 1.0 refers to one string and 2.0 refers to another, what does 1.5 refer to? Better to use trigrams as the labels for the coordinates and weight them by inverse document frequency. On Tue, Sep 1, 2009 at 6:28 AM, Juan Francisco Contreras Gaitan < [email protected]> wrote: > > ... I could use a Map between doubles and strings: storaging doubles in all > the algorithm, and retrieving the strings to compute distance in measuring > steps. >
