On Thu, Jan 28, 2010 at 3:26 AM, Shashikant Kore <shashik...@gmail.com>wrote:

> Jake,
>
> The distance optimization was done in MAHOUT-121.
> http://issues.apache.org/jira/browse/MAHOUT-121


Thanks for pointing that out.  I think I followed the math, the issue was
that in
all that back and forth discussing in MAHOUT-121 about where to declare
variables,
the actual formula for the optimized distance was not carefully checked (the
optimized method in SquaredEuclideanDistanceMeasure was not actually
returning the actual squared euclidean distance!)

That and we are caching lengthSquared of vectors whenever possible now, and
vectors have getDistanceSquared(Vector) as well (not sure if that existed in
the
timeframe of MAHOUT-121), which itself can be optimized particular vector
implementations... that part isn't done currently in trunk (I had it done
wrong
in one sparse impl.

If you want to run your clustering set again, and see what the timing
performance you get in comparison to what it was before, I'd like to hear
what
the difference is that you see!

  -jake


> The idea is described neatly on LingPipe blog
>
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
>
> I will go through the conversation between you and Ted, and chip in
> wherever needed.
>
> --shashi
>
> On Wed, Jan 27, 2010 at 11:42 PM, Jake Mannix <jake.man...@gmail.com>
> wrote:
> > The interface defines two methods:
> >
> >
> >  double distance(Vector v1, Vector v2);
> >  double distance(double centroidLengthSquare, Vector centroid, Vector v);
> >
> >
> > With the latter being an optimized form of the former, and satisfies:
> >
> >  distance(v1, v2) == distance(v1.getLengthSquared(), v1, v2)
> >
> > Is this correct?  Every place I see this method called, it is used in
> this
> > fashion, at least...
> >
> >  -jake
> >
>

Reply via email to