On Wed, Jan 27, 2010 at 11:35 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> That does look wrong. > Well that's problematic, because that code is used all throughout all of our clusterers (KMeans, Canopy, for starters). > The origin of the optimization is that (x-y)^2 = x^2 + y^2 - 2 x y = x^2 + > y > (y - x). If x^2 is pre-computed and y is sparse, then this second form can > be much faster than the first. This part is totally true, but does not appear to be what is implemented. > Whether the third form is better still > depends on implementation, but if it is done without an extra vector > allocation, then it should be better because it only iterates through y > once. > x^2 + y.(y-x) Hmm... how exactly do you do both a dot product and a subtraction on one iteration? I guess if x is dense, for(entry e : y) result += e.get() * (e.get() - x.getQuick(e.index())); Ok. Of course, if y is caching it's lengthSquared already, you only need to do the iteration once, and also don't need to do a subtraction inside the inner loop. Of course, just calling vector.getDistanceSquared(other) allows the vector impl to take care of these optimizations themself. Unfortunately, this only does SquaredEuclideanDistanceMeasure. (On the other hand, the whole idea of optimizing via (x-y)^2 = x^2 + y^2 - 2x.y depends completely on the distance measure being defined in terms of dot products, and doesn't apply to the general case [Manhattan, Tanimoto, Cosine]). It seems the better way to do this would be to call MAHOUT-209 methods like aggregate(Vector v, BinaryFunction aggregator, BinaryFunction combiner) with combiner being plus, aggregators with impls pulled out of the DistanceMeasure versions. Of course, now that I think of that, MAHOUT-209 would need to be extended to allow a version like: aggregate(double initialValue, Vector v, BinaryFunction agg, BinaryFunction c); which allowed you to set the initial value to something other than zero. -jake