On Sat, Jan 9, 2010 at 9:18 AM, Grant Ingersoll <[email protected]> wrote:

> > In the past, I have used other more heuristic measures as well.  One of
> the
> > key characteristics that I would like to see out of a clustering is a
> degree
> > of stability.  Thus, I look at the fractions of points that are assigned
> to
> > each cluster
>
> So if we just spit out the percentage of points for each cluster, then
> someone could re-run with the original data plus the held out data, and
> those percentages should still be about the same, assuming the held-out data
> is randomly distributed in the space.
>

Spit out counts, of course, rather than percentages so that you can
distinguish small count anomalies.


>
> > or the distribution of distances from the cluster centroid.
>
> Likewise, we'd expect the held out points to be randomly distributed
> distance wise as well, right?
>

Not so much randomly as distributed randomly *the*same*way* for both test
and training.


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to