On Wed, Jan 27, 2010 at 11:35 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> That does look wrong.
>

Well that's problematic, because that code is used all throughout all of our
clusterers (KMeans, Canopy, for starters).


> The origin of the optimization is that (x-y)^2 = x^2 + y^2 - 2 x y = x^2 +
> y
> (y - x).  If x^2 is pre-computed and y is sparse, then this second form can
> be much faster than the first.


This part is totally true, but does not appear to be what is implemented.


>  Whether the third form is better still
> depends on implementation, but if it is done without an extra vector
> allocation, then it should be better because it only iterates through y
> once.
>

x^2 + y.(y-x)

Hmm... how exactly do you do both a dot product and a subtraction on one
iteration?  I guess if x is dense,

for(entry e : y)
  result += e.get() * (e.get() - x.getQuick(e.index()));

Ok.  Of course, if y is caching it's lengthSquared already, you only need to
do the iteration once, and also don't need to do a subtraction inside the
inner loop.

Of course, just calling vector.getDistanceSquared(other) allows the vector
impl to take care of these optimizations themself.  Unfortunately, this only
does SquaredEuclideanDistanceMeasure.  (On the other hand, the whole
idea of optimizing via (x-y)^2 = x^2 + y^2 - 2x.y depends completely
on the distance measure being defined in terms of dot products, and
doesn't apply to the general case [Manhattan, Tanimoto, Cosine]).

It seems the better way to do this would be to call MAHOUT-209 methods
like aggregate(Vector v, BinaryFunction aggregator, BinaryFunction
combiner)
with combiner being plus, aggregators with impls pulled out of the
DistanceMeasure versions.  Of course, now that I think of that, MAHOUT-209
would need to be extended to allow a version like:

  aggregate(double initialValue, Vector v, BinaryFunction agg,
BinaryFunction c);

which allowed you to set the initial value to something other than zero.

  -jake

Reply via email to