Re: Cluster distance

Grant Ingersoll Thu, 07 Jan 2010 05:20:57 -0800

On Jan 7, 2010, at 7:51 AM, Bogdan Vatkov wrote:

> I see but I was looking for more practical definition - e.g. if I use one or
> another distance measure class what would be the effect.
> The mathematical explanations in the javadoc are not helping much.
> If there is no way to explain different algorithms for distance in more
> practical way then maybe we do not need different algorithms :)
> - e.g. is the distance affected more by the number of common terms or the
> weights of common terms or ... - this is just a possible example, I do not
> know if it matches any of the distance algorithms.
> there should be a guidance for the ones that will use the stuff - it is
> expected that these users know something about their input data and based on
> different characteristics of that data (e.g. number of docs, doc size, etc.)
> and desired result (e.g. number of clusters, number of unique term in
> clusters, etc.) to be able to pick the right Mahout configuration - with
> regards to numbers, classes, algorithms, etc.
> I currently miss such a guideline.

Typically, it is the case that the source of the data dictates the measures, 
etc.  AIUI, text is best represented by using a 1 or 2-norm and then using the 
appropriate distance measure (Manhattan, Euclidean or Cosine).  Some of the 
other measures are best suited for other kinds of data, but I don't have a good 
sense for them yet.  I've gotten decent to good results (not formally 
validated) on news text using Cosine and vectors normalized using the 2-norm.   
As Ted said to me the other day, though, K-Means, in particular, is fairly 
robust even if you aren't strict about matching the normalization w/ the 
distance measure.

I do think, though, that a lot of it comes down to trial and error with your 
data.  We are working on some scripts to make this a lot easier.  One of the 
things that we need to do is build up benchmarks w/ common collections (see the 
Open Relevance Project under Lucene) so that people can make comparisons and 
see how this all works.

I'm sure others can chime in w/ more of their experience.

> 
> On Thu, Jan 7, 2010 at 2:38 PM, Felix Lange <[email protected]> wrote:
> 
>> Hi Bodgan,
>> I didn't read any javadocs about this package, but the cluster distance
>> should be the distance between two clusters. There are different distance
>> measures in this respect, e.g. you can take the distance between two
>> clusters' centers as their distance value.
>> Greetings
>> Felix
>> 
>> 
>> 2010/1/6 Bogdan Vatkov <[email protected]>
>> 
>>> What is the practical meaning of the "cluster distance" e.g. I am
>> currently
>>> using org.apache.mahout.common.distance.CosineDistanceMeasure but I do
>> not
>>> have any clue what does that mean and what other values could bring to
>> the
>>> game. Any guidance here?
>>> 
>>> --
>>> Best regards,
>>> Bogdan
>>> 
>> 
> 
> 
> 
> -- 
> Best regards,
> Bogdan

Re: Cluster distance

Reply via email to