Re: Distance calculation performance issue

nfantone Thu, 30 Jul 2009 06:16:39 -0700

> Agreed, can you put your changes up as a patch on MAHOUT-121?  That way we 
> can do file diffs, etc.


I'm willing to do so. But, then again, I didn't know for sure if what
I changed was correct or if I just didn't interpret the distance
calculation method. After reading what Shashinkant suggested, I'll try
to make a patch.

> --input ../nfantone/user.data --clusters ../nfantone/output/clusters --k 10 
> --output ../content/nfantone/output/ --convergence 0.01 --overwrite

I see you choose k=10, as oppose to 200. That could explain our differences.

> We took the idea of optimized distance calculation from LingPipe. I
> suggest you to read this as I won't be able to communicate the idea as
> crisply.  After reading this post, you will be able to relate the
> code.
>
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/

I'll read this. I have a feeling I misunderstood the distance process.
I'll comment soon.

> Yes, some of the method names are misnomers (or downright misleading)
> as there is no good term to describe the intermediate values.

Surely we can think of another name for that particular method. A name
may be misleading, but it shouldn't be something that refers to some
OTHER thing it doesn't do.

>> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>>
>>    if (centroid.size() != v.size()) {
>>      throw new CardinalityException();
>>    }
>>
>>    double result = centroidLengthSquare;
>>    result += v.getDistanceSquared(centroid);
>>    return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> Here, we don't call v.getDistanceSquared(centroid) again (which is
> redundant). It simply returns result calculated in the prvious step.

You are completely right. My mistake. While refactoring, it seems I
modified those lines and then left them unchanged when writing my
post. That redundant call is not in the original code. However, I fail
to see the need in declaring a double, += something to it, etc... this
should simply be:

if (centroid.size() != v.size()) {
     throw new CardinalityException();
}
return centroidLengthSquare +  v.getDistanceSquared(centroid);


> With the optimization of LingPipe, you will incur the calculations
> equal to the non-zero features in only one vector. You don't need to
> iterate on cetroid vector for every distance calculation.
>
> Let me know if I am off the mark here.

You aren't. Again, I think I got the algebra wrong. I'll let you know.

Re: Distance calculation performance issue

Reply via email to