Take for example the following two measures. The first takes the distance
between centers as the cluster distance, and the second takes the minimal
pairwise distance between elements of two clusters as the cluster distance.
If you use the first measure, it could be the case that two clusters almost
overlap. so if you have two clusters of strings {xAAAA ,yAAA,zAAA } and
{xBBB,uBBB,uBBB} (where the string distance measure is based on identity of
letters), the distance between the mean of the two clusters might be big,
because the three-letter-block in each string differs between the two
clusters. but the strings xAAA and xBBB are quite similar, because they
share a letter (which is also at the same position in each string). So the
minimal distance between elements of those clusters could be quite small,
something you maybe want to avoid. So this could be an example in which you
choose the second distance measure and impose a minimal value as a threshold
for the clustering.
fx
2010/1/7 Sean Owen <[email protected]>
> Yes, I mean, it's possible to explain what each of the algorithms
> does, both formally and intuitively. But it's hard to explain in which
> cases one metric might be more desirable than another. Yes, in a sense
> they all should do the same thing -- define a consistent distance
> metric between vectors. But they're different metrics.
>
> Even I don't try to think too hard about which one is best. I just try
> them all when trying to fit the best algorithm to a set of data. So
> it's still good to have different implementations.
>
> On Thu, Jan 7, 2010 at 12:51 PM, Bogdan Vatkov <[email protected]>
> wrote:
> > I see but I was looking for more practical definition - e.g. if I use one
> or
> > another distance measure class what would be the effect.
> > The mathematical explanations in the javadoc are not helping much.
> > If there is no way to explain different algorithms for distance in more
> > practical way then maybe we do not need different algorithms :)
> > - e.g. is the distance affected more by the number of common terms or the
> > weights of common terms or ... - this is just a possible example, I do
> not
> > know if it matches any of the distance algorithms.
> > there should be a guidance for the ones that will use the stuff - it is
> > expected that these users know something about their input data and based
> on
> > different characteristics of that data (e.g. number of docs, doc size,
> etc.)
> > and desired result (e.g. number of clusters, number of unique term in
> > clusters, etc.) to be able to pick the right Mahout configuration - with
> > regards to numbers, classes, algorithms, etc.
> > I currently miss such a guideline.
>