I'll check things up tomorrow, at work, make the patch, and I'll let you know.
Thanks, Grant.

On Tue, Jul 28, 2009 at 5:29 PM, Grant Ingersoll<[email protected]> wrote:
> Funny, I did this same thing on the plane by revisiting Shashi's patch on
> MAHOUT-121 and properly applying it to K-Means (I missed one critical line
> that uses the centroid based distance method instead of the (Vector, Vector)
> one).  I think your data set ran, for 10 iterations, in just over 2 minutes
> and that was with the profiler hooked up, too.
>
> I will post a patch on MAHOUT-121, can you take a look?  Also, if you can
> post your changes as a patch, then we can likely compare/merge, etc. and
> come up with the best way of doing all of this.
>
> Thanks for the insight/analysis.
>
> -Grant
>
>
> On Jul 28, 2009, at 10:16 AM, nfantone wrote:
>
>> Ok, this post is going to be a long one, and so it deserves its own
>> thread. My apologies beforehand.
>>
>> Here's what I have tried to ease the distance calculation problem. I
>> know it's quite nasty, but I wanted to ensure our bottleneck was
>> there, in the euclidean distance computation.
>>
>> But first, a little excursus:
>>
>> /* EXCURSUS */
>> The "optimized" version for SparseVectors of distance() in
>> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>>
>>   if (centroid.size() != v.size()) {
>>     throw new CardinalityException();
>>   }
>>
>>   double result = centroidLengthSquare;
>>   result += v.getDistanceSquared(centroid);
>>   return centroidLengthSquare + v.getDistanceSquared(centroid);
>>
>> It expects a vector squared length, which is finally added to what
>> getDistanceSquared() returns. However, that method computes this
>> inside a while loop (for the comments, assume two 2D vectors [X0, X1]
>> and [Y0, Y1]):
>>
>>     elt = iter.next();
>>     centroidValue = v.getQuick(elt.index());
>>     delta = elt.get() - centroidValue;
>>     result += (delta * delta) - (centroidValue * centroidValue); //
>> This is  (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1)  - Y1*Y1; I
>> don't have the slightest idea what to call this value.
>>
>> I certainly couldn't understand the reason behind that (centroidValue
>> * centroidValue) subtraction. Being called getDistanceSquared, the
>> method should return just that... and that is the value one gets by
>> just ignoring that substraction, that is:
>>
>> ...
>> result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 -
>> Y1)*(X1 - Y1); the ACTUAL squared distance.
>> ...
>>
>> Moreover, the sum of every (centroidValue * centroidValue) in each
>> iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to
>> centroidLengthSquare, which was previously calculated before calling
>> distance(centroidLengthSquare, centroid, v) and then added to the
>> result. So, to sum up: first, a certain length is calculated (the
>> squared norm of what the signature from distance() calls "centroid");
>> second, that SAME value gets calculated again in an another method and
>> subtracted from a certain total; last, those values cancel each other
>> by summing both totals, leaving just the wanted squared distance,
>> here:
>>
>> return centroidLengthSquare + v.getDistanceSquared(centroid);
>>
>> Which could be re-written as:
>>
>> return centroidLengthSquare + squaredDistanceBetweenCentroidAndV -
>> centroidLengthSquare;
>>
>> Maybe this has been done on purpose, though that purpose eludes me for
>> now.
>>
>> /* END EXCURSUS */
>>
>> And now, the fun part: code disrupting. Here are the changes I applied
>> (remember: just for testing performance's sake). I left the changed
>> bits commented.
>>
>> EuclideanDistanceMeasure.java
>> Renamed distance(double centroidLengthSquare, Vector centroid, Vector
>> v) to sparseDistance and eliminate the first param. We don't need the
>> sqrt to compare real distances for emitting points to clusters (but we
>> do need them to compute a cluster's convergence), so I took that out,
>> as well (I know the whole point of this class is to sqrt the results
>> from SquaredEuclideanDistance, but I needed the distance function
>> that's compromising performance to do a minimal set of calculations) .
>>
>> �...@override
>>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
>> centroid, Vector v) {
>>   return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/
>> centroid, v);//*);
>>  }
>>
>>
>> SquaredEuclideanDistanceMeasure.java
>> Rewrote and renamed the implementation of distance(double
>> centroidLengthSquare, Vector centroid, Vector v). Ignored size check
>> (again: just for the benefit of performance). Just return the result
>> of getDistanceSquared().
>>
>> @Override
>>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
>> centroid, Vector v) {
>> /*    if (centroid.size() != v.size()) {
>>     throw new CardinalityException();
>>   }
>>
>>   double result = centroidLengthSquare;
>>  result += v.getDistanceSquared(centroid);
>> */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid);
>>  }
>>
>> SparseVector.java
>> Refactor of getDistanceSquared(). Variables should not be created in a
>> loop, if there's no need to do so. Eliminate the centroidValue^2
>> calculation in each iteration.
>>
>> @Override
>>  public double getDistanceSquared(Vector v) {
>>   //TODO: Check sizes?
>>
>>   double result = 0.0;
>>   Iterator<Vector.Element> iter = iterateNonZero();
>>   Vector.Element elt;
>>   double delta;
>>
>>   while (iter.hasNext()) {
>>     elt = iter.next();
>>     delta = elt.get() - v.getQuick(elt.index());
>>     result += (delta * delta); //- (centroidValue * centroidValue);
>>   }
>>   return result;
>>  }
>>
>> Cluster.java
>> Refactor of emitPointToNearestCluster(). null comparison eliminated
>> (not necessary anymore). sparseDistance() is called now, instead.
>>
>> ...
>>  Cluster nearestCluster = null;
>>   double nearestDistance = Double.MAX_VALUE;
>>   double distance = 0;
>>   for (Cluster cluster : clusters) {
>>     distance = measure.sparseDistance(cluster.getCenter(), point);
>>     if (distance < nearestDistance) {
>>       nearestCluster = cluster;
>>       nearestDistance = distance;
>>     }
>>   }
>> ...
>>
>> Add a Math.sqrt call to computeConvergence(), which now uses
>> sparseDistance().
>>
>>  public boolean computeConvergence() {
>>   Vector centroid = computeCentroid();
>>   converged =
>> Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/
>> centroid, center)) <= convergenceDelta;
>>   return converged;
>>  }
>>
>>
>> After all those modifications, kMeans now runs noticeably faster.
>> Running the whole thing locally -in a pseudo-distributed cluster-,
>> every iteration is taking me about ~7 minutes to complete using the
>> very same dataset I uploaded and posted some days ago, which used to
>> last 3 hours. Again, this is not meant to be a final, closed solution
>> at all. But, I think it could very well serve as a first step towards
>> that.
>
>

Reply via email to