On Sat, Jan 9, 2010 at 9:18 AM, Grant Ingersoll <[email protected]> wrote:
> > In the past, I have used other more heuristic measures as well. One of > the > > key characteristics that I would like to see out of a clustering is a > degree > > of stability. Thus, I look at the fractions of points that are assigned > to > > each cluster > > So if we just spit out the percentage of points for each cluster, then > someone could re-run with the original data plus the held out data, and > those percentages should still be about the same, assuming the held-out data > is randomly distributed in the space. > Spit out counts, of course, rather than percentages so that you can distinguish small count anomalies. > > > or the distribution of distances from the cluster centroid. > > Likewise, we'd expect the held out points to be randomly distributed > distance wise as well, right? > Not so much randomly as distributed randomly *the*same*way* for both test and training. -- Ted Dunning, CTO DeepDyve
