On Jan 7, 2010, at 7:51 AM, Bogdan Vatkov wrote: > I see but I was looking for more practical definition - e.g. if I use one or > another distance measure class what would be the effect. > The mathematical explanations in the javadoc are not helping much. > If there is no way to explain different algorithms for distance in more > practical way then maybe we do not need different algorithms :) > - e.g. is the distance affected more by the number of common terms or the > weights of common terms or ... - this is just a possible example, I do not > know if it matches any of the distance algorithms. > there should be a guidance for the ones that will use the stuff - it is > expected that these users know something about their input data and based on > different characteristics of that data (e.g. number of docs, doc size, etc.) > and desired result (e.g. number of clusters, number of unique term in > clusters, etc.) to be able to pick the right Mahout configuration - with > regards to numbers, classes, algorithms, etc. > I currently miss such a guideline.
Typically, it is the case that the source of the data dictates the measures, etc. AIUI, text is best represented by using a 1 or 2-norm and then using the appropriate distance measure (Manhattan, Euclidean or Cosine). Some of the other measures are best suited for other kinds of data, but I don't have a good sense for them yet. I've gotten decent to good results (not formally validated) on news text using Cosine and vectors normalized using the 2-norm. As Ted said to me the other day, though, K-Means, in particular, is fairly robust even if you aren't strict about matching the normalization w/ the distance measure. I do think, though, that a lot of it comes down to trial and error with your data. We are working on some scripts to make this a lot easier. One of the things that we need to do is build up benchmarks w/ common collections (see the Open Relevance Project under Lucene) so that people can make comparisons and see how this all works. I'm sure others can chime in w/ more of their experience. > > On Thu, Jan 7, 2010 at 2:38 PM, Felix Lange <[email protected]> wrote: > >> Hi Bodgan, >> I didn't read any javadocs about this package, but the cluster distance >> should be the distance between two clusters. There are different distance >> measures in this respect, e.g. you can take the distance between two >> clusters' centers as their distance value. >> Greetings >> Felix >> >> >> 2010/1/6 Bogdan Vatkov <[email protected]> >> >>> What is the practical meaning of the "cluster distance" e.g. I am >> currently >>> using org.apache.mahout.common.distance.CosineDistanceMeasure but I do >> not >>> have any clue what does that mean and what other values could bring to >> the >>> game. Any guidance here? >>> >>> -- >>> Best regards, >>> Bogdan >>> >> > > > > -- > Best regards, > Bogdan
