+1 Patch looks reasonable enough. You'd need to modify the other clustering algorithms to achieve uniformity.
The assumption about input seeds originally came from using Canopy to prime KMeans but it has become the prior set of clusters since the algorithms have converged on common formats & models. Each iteration reads in the set of clusters-n and outputs clusters-n+1, so changing this would have broad impact. FuzzyK and Dirichlet use the same iteration semantics and the ClusterIterator depends on this for unification with classification interfaces. -----Original Message----- From: Grant Ingersoll [mailto:[email protected]] Sent: Wednesday, July 13, 2011 3:08 PM To: [email protected] Subject: Re: Emitting distance from centroid for K-Means I put up a patch, do you think that it looks reasonable? I'm not totally thrilled by it, but it is a start. On a related note, is there any reason why the input seeds can't be Vectors as an alternative to Cluster? -Grant On Jul 13, 2011, at 5:38 PM, Jeff Eastman wrote: > Mostly. Clustering assigns points to one or more clusters, and it uses the > distance measure or model pdf to do this. So the distance from each point to > the cluster center is calculated in this step but thrown away once the > assignment(s) is(are) made. This information could be output to another file > or a different version could output the distance directly instead of the pdf. > I don't know what that would mean for Dirichlet; however, since it only plays > with pdf values. > > -----Original Message----- > From: Grant Ingersoll [mailto:[email protected]] > Sent: Wednesday, July 13, 2011 1:36 PM > To: [email protected] > Subject: Re: Emitting distance from centroid for K-Means > > Isn't --clustering the post processing step that already does it? > > On Jul 13, 2011, at 4:31 PM, Jeff Eastman wrote: > >> Well, distance is dependent upon the distance measure you want to use. A >> post-processing step could easily calculate this. The ClusterEvaluator may >> have some methods that could be useful. It calculates a set of >> representative points for each cluster and calculates interCluster and >> intraCluster densities from that. >> >> -----Original Message----- >> From: Grant Ingersoll [mailto:[email protected]] >> Sent: Wednesday, July 13, 2011 1:28 PM >> To: [email protected] >> Subject: Re: Emitting distance from centroid for K-Means >> >> Good to know. Next question, what's the preferred way, then, to get out >> either the distance or what Ted said? >> >> -Grant >> >> On Jul 13, 2011, at 4:25 PM, Ted Dunning wrote: >> >>> I take back what I said. >>> >>> Jeff is correct. >>> >>> On Wed, Jul 13, 2011 at 1:23 PM, Jeff Eastman <[email protected]> wrote: >>> >>>> The weight is the probability the vector is a member of the cluster. For >>>> FuzzyK and Dirichlet it is fractional, for KMeans it is 1 as the algorithm >>>> is maximum likelihood and each point is only assigned to a single cluster. >>>> >>>> -----Original Message----- >>>> From: Grant Ingersoll [mailto:[email protected]] >>>> Sent: Wednesday, July 13, 2011 1:11 PM >>>> To: [email protected] >>>> Subject: Emitting distance from centroid for K-Means >>>> >>>> Does it make sense to output the distance to the cluster as the weight in >>>> the KMeansClusterer.outputPointWithClusterInfo method instead of 1? What's >>>> the purpose of the 1 as the weight? >>>> >>>> -Grant >>>> >>>> >>>> >> >> -------------------------- >> Grant Ingersoll >> >> >> > > -------------------------- > Grant Ingersoll > > > -------------------------- Grant Ingersoll
