Yes, I mean, it's possible to explain what each of the algorithms does, both formally and intuitively. But it's hard to explain in which cases one metric might be more desirable than another. Yes, in a sense they all should do the same thing -- define a consistent distance metric between vectors. But they're different metrics.
Even I don't try to think too hard about which one is best. I just try them all when trying to fit the best algorithm to a set of data. So it's still good to have different implementations. On Thu, Jan 7, 2010 at 12:51 PM, Bogdan Vatkov <[email protected]> wrote: > I see but I was looking for more practical definition - e.g. if I use one or > another distance measure class what would be the effect. > The mathematical explanations in the javadoc are not helping much. > If there is no way to explain different algorithms for distance in more > practical way then maybe we do not need different algorithms :) > - e.g. is the distance affected more by the number of common terms or the > weights of common terms or ... - this is just a possible example, I do not > know if it matches any of the distance algorithms. > there should be a guidance for the ones that will use the stuff - it is > expected that these users know something about their input data and based on > different characteristics of that data (e.g. number of docs, doc size, etc.) > and desired result (e.g. number of clusters, number of unique term in > clusters, etc.) to be able to pick the right Mahout configuration - with > regards to numbers, classes, algorithms, etc. > I currently miss such a guideline.
