So now, it RandomAccessSparseVector seems to be the most affected.
Pretty much every regression is related to RASV. Could it be that it's
better to handle it as a non-constant time update Vector and have drop the
in-place updates?
Otherwise, the code that implements Minus is pretty much the same as trunk:
Iterator<Vector.Element> yi = y.iterateNonZero();
Vector.Element ye;
while (yi.hasNext()) {
ye = yi.next();
x.setQuick(ye.index(), f.apply(x.getQuick(ye.index()), ye.get()));
}
return x;
The main difference is that it's not going through incrementQuick which for
RASVs skips a few calls.
On Tue, Apr 30, 2013 at 5:45 AM, Robin Anil <[email protected]> wrote:
> After this change the benchmark actually takes about 41 minutes :)
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 29, 2013 at 9:45 PM, Robin Anil <[email protected]> wrote:
>
> > After increasing the lead time to 15 seconds, I believe I am giving
> enough
> > time for JIT to take place. This way we are measuring only the JITed code
> > not the interpreted code. There is a flag in JVM with which you can
> change
> > the threshold -XX:+CompileThreshold (default: 10000). I didnt want to
> mess
> > with it to make it as close as possible to the real world.
> >
> > I have also disabled the cluster benchmarks as they are not executed
> > enough times for JIT to compile them. I have to look more at them. for
> now
> > ignore them for your patch.
> >
> > But you can see that variance is greatly reduced (See Sheet Version8)
> >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=8
> >
> > There are only a handful of regressions Norm1 (for Seq and Rand), Minus
> > and Plus (for Rand) and a few others.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Mon, Apr 29, 2013 at 11:37 AM, Robin Anil <[email protected]>
> wrote:
> >
> >> I have a few ideas to tweak around with the JIT to make the benchmarks
> >> more stable. I will ping back once I get some time to do something.
> >>
> >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>
> >>
> >> On Mon, Apr 29, 2013 at 10:32 AM, Dan Filimon <
> >> [email protected]> wrote:
> >>
> >>> That's the weirdness, it's _exactly_ the same code.
> >>>
> >>>
> >>> On Mon, Apr 29, 2013 at 6:31 PM, Robin Anil <[email protected]>
> >>> wrote:
> >>>
> >>> > I will pull your Patch in and take a look a close look at it tonight.
> >>> Whats
> >>> > different in Run3 in Version7 v/s Run2 and Run1. Because Run3 looks
> >>> really
> >>> > good and the regressions of 30% in some are actually not as bad and
> >>> nothing
> >>> > no longer looks ridiculously blocking the patch.
> >>> >
> >>> >
> >>> > On Mon, Apr 29, 2013 at 10:13 AM, Dan Filimon
> >>> > <[email protected]>wrote:
> >>> >
> >>> > > I uploaded the latest patch to my vector branch on github [0].
> >>> > >
> >>> > > The latest revision is, as always, here [1], "Version 7
> >>> > (Assign/Aggregate;
> >>> > > 1 sec runtime)". Tests pass for this one except for some aggregate
> >>> tests
> >>> > > (it's not clear if aggregate should throw an exception when the
> >>> vectors
> >>> > are
> >>> > > empty or return 0; I chose return 0).
> >>> > >
> >>> > > There's quite a bit of variance, a lot of it even in code that I
> >>> haven't
> >>> > > changed between runs. So, norm2 shows regressions in version 7 and
> >>> > > improvements in version 6 and the code not having been changed.
> >>> > >
> >>> > > The dot product is also oscillating in version 7, showing different
> >>> > results
> >>> > > in the different runs (of the same algorithm).
> >>> > >
> >>> > > Except for some cleanup, I don't think I can do much better than
> >>> this, so
> >>> > > could someone please also run the benchmarks and I'll update the
> >>> patch on
> >>> > > ReviewBoard?
> >>> > >
> >>> > > [0] https://github.com/dfilimon/mahout/tree/vector
> >>> > > [1]
> >>> > >
> >>> > >
> >>> >
> >>>
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=7
> >>> > >
> >>> > >
> >>> > > On Sun, Apr 28, 2013 at 11:18 PM, Ted Dunning <
> [email protected]
> >>> >
> >>> > > wrote:
> >>> > >
> >>> > > > It would *really* help to put a legend on these spreadsheets even
> >>> if
> >>> > you
> >>> > > > have described it in the past.
> >>> > > >
> >>> > > > I take it that these are number of ops and bigger is better,
> right?
> >>> > > >
> >>> > > > And where is a description of what patch runs 1 and 2 actually
> are?
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > On Sun, Apr 28, 2013 at 2:23 AM, Dan Filimon <
> >>> > > [email protected]>wrote:
> >>> > > >
> >>> > > >> Here's [1] the most recent benchmarks, after refactoring the
> >>> Assign
> >>> > into
> >>> > > >> separate classes.
> >>> > > >> It's now using a 10 second runtime and a 1 second lead time and
> >>> these
> >>> > > are
> >>> > > >> only the benchmarks involving assign().
> >>> > > >>
> >>> > > >> Oddly enough there are regressions with
> RandomAccessSparseVectors'
> >>> > > plus()
> >>> > > >> and minus() despite these working like trunk from what I can
> tell
> >>> from
> >>> > > the
> >>> > > >> tests (VectorBinaryAssignTest).
> >>> > > >>
> >>> > > >> Also, there's quite a bit of variance for some reason... Clone,
> >>> Create
> >>> > > >> are just two of the odd ones.
> >>> > > >>
> >>> > > >> [1]
> >>> > > >>
> >>> > >
> >>> >
> >>>
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4
> >>> > > >>
> >>> > > >>
> >>> > > >> On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil <
> [email protected]
> >>> >
> >>> > > wrote:
> >>> > > >>
> >>> > > >>> Do in place for rand an dense. Assign a bug to me to speed up
> >>> rasv
> >>> > > >>> GetElement.
> >>> > > >>> On Apr 25, 2013 2:56 PM, "Dan Filimon" <
> >>> [email protected]>
> >>> > > >>> wrote:
> >>> > > >>>
> >>> > > >>> > Nearly done splitting the code up, but I'm not sure what the
> >>> costs
> >>> > > >>> should
> >>> > > >>> > ideally be.
> >>> > > >>> >
> >>> > > >>> > Robin, you proposed: cost of iteration + cost of lookup +
> cost
> >>> of
> >>> > > >>> update
> >>> > > >>> > (if its not in-place)
> >>> > > >>> >
> >>> > > >>> > This sounds like it's per element, rather than for the entire
> >>> > vector.
> >>> > > >>> Also,
> >>> > > >>> > are we just going to assume that there will be as many
> updates
> >>> as
> >>> > > >>> non-zeros
> >>> > > >>> > or as large as the array is? How does the cost of an update
> >>> factor
> >>> > > in?
> >>> > > >>> >
> >>> > > >>> > And, is "in-place" just for DenseVectors?
> >>> > > >>> >
> >>> > > >>> >
> >>> > > >>> > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil <
> >>> [email protected]>
> >>> > > >>> wrote:
> >>> > > >>> >
> >>> > > >>> > > Depends on the speed of copying a 1M double array v/s
> doing a
> >>> > > calloc
> >>> > > >>> +
> >>> > > >>> > > copying 1000 non zeros (Assuming java is doing that
> >>> underneath).
> >>> > > >>> > >
> >>> > > >>> > > ------
> >>> > > >>> > > Robin Anil
> >>> > > >>> > >
> >>> > > >>> > >
> >>> > > >>> > >
> >>> > > >>> > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <
> >>> > > >>> > [email protected]
> >>> > > >>> > > >wrote:
> >>> > > >>> > >
> >>> > > >>> > > > Right, but is clone() generally slower than assigning?
> That
> >>> > > >>> strikes me
> >>> > > >>> > as
> >>> > > >>> > > > odd; doesn't Java optimize copying the internal
> structures
> >>> > (there
> >>> > > >>> are
> >>> > > >>> > > > arrays underneath after all)?
> >>> > > >>> > > >
> >>> > > >>> > > >
> >>> > > >>> > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <
> >>> > > [email protected]
> >>> > > >>> >
> >>> > > >>> > > wrote:
> >>> > > >>> > > >
> >>> > > >>> > > >> Seems like for dense clone is slower than like().assign
> I
> >>> need
> >>> > > to
> >>> > > >>> test
> >>> > > >>> > > it
> >>> > > >>> > > >> with different sizes to be sure. I kept it from the old
> >>> > > behavior.
> >>> > > >>> > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <
> >>> > > >>> [email protected]>
> >>> > > >>> > > >> wrote:
> >>> > > >>> > > >>
> >>> > > >>> > > >>> Okay, so I should split it further into smaller
> sub-cases
> >>> > that
> >>> > > >>> handle
> >>> > > >>> > > >>> each Vector type. I tried making so that these match
> the
> >>> > cases
> >>> > > >>> in the
> >>> > > >>> > > >>> document to the extent possible.
> >>> > > >>> > > >>> You're right. It is ugly and I need to split it up.
> >>> > > >>> > > >>>
> >>> > > >>> > > >>> I removed the OrderedIntDoubleMapping (but with another
> >>> > if...)
> >>> > > >>> but
> >>> > > >>> > one
> >>> > > >>> > > >>> more thing that helped is making the assign() only go
> >>> through
> >>> > > the
> >>> > > >>> > > non-zeros
> >>> > > >>> > > >>> when setting up a new copy.
> >>> > > >>> > > >>>
> >>> > > >>> > > >>> About the copying, I saw that the optimized copy
> doesn't
> >>> > always
> >>> > > >>> call
> >>> > > >>> > > >>> clone clone on the Vector. Why is this?
> >>> > > >>> > > >>>
> >>> > > >>> > > >>> Thanks for the test and the feedback! :)
> >>> > > >>> > > >>>
> >>> > > >>> > > >>>
> >>> > > >>> > > >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <
> >>> > > >>> [email protected]
> >>> > > >>> > > >wrote:
> >>> > > >>> > > >>>
> >>> > > >>> > > >>>> Yes its reaching dev. I checked out your code to take
> a
> >>> > look.
> >>> > > I
> >>> > > >>> > > >>>> understand you are trying to still mix
> >>> isSequentialAccess
> >>> > etc
> >>> > > >>> in the
> >>> > > >>> > > >>>> validity calculation. The whole point of this
> >>> architecture
> >>> > is
> >>> > > to
> >>> > > >>> > > completely
> >>> > > >>> > > >>>> unroll things to be more modular instead of creating a
> >>> giant
> >>> > > >>> > spaghetti
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> Here is an example of what I am trying to say.
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> This is your current implementation of
> >>> > > >>> AssignIterateOneLookupOther.
> >>> > > >>> > > >>>> There are so many conditionals
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> 1. @Override
> >>> > > >>> > > >>>> 2. public Vector assign(Vector x, Vector y,
> >>> > > >>> > > >>>> DoubleDoubleFunction f) {
> >>> > > >>> > > >>>> 3. return !swap ? assignInner(x, y, f) :
> >>> > > >>> assignInner(y, x,
> >>> > > >>> > > f);
> >>> > > >>> > > >>>> 4. }
> >>> > > >>> > > >>>> 5.
> >>> > > >>> > > >>>> 6. public Vector assignInner(Vector x, Vector
> y,
> >>> > > >>> > > >>>> DoubleDoubleFunction f) {
> >>> > > >>> > > >>>> 7. Iterator<Vector.Element> xi =
> >>> > x.iterateNonZero();
> >>> > > >>> > > >>>> 8. Vector.Element xe;
> >>> > > >>> > > >>>> 9. Vector.Element ye;
> >>> > > >>> > > >>>> 10. OrderedIntDoubleMapping updates =
> >>> > > >>> > > newOrderedIntDoubleMapping();
> >>> > > >>> > > >>>> 11. while (xi.hasNext()) {
> >>> > > >>> > > >>>> 12. xe = xi.next();
> >>> > > >>> > > >>>> 13. ye = y.getElement(xe.index());
> >>> > > >>> > > >>>> 14. if (!swap) {
> >>> > > >>> > > >>>> 15. xe.set(f.apply(xe.get(), ye.get()));
> >>> > > >>> > > >>>> 16. } else {
> >>> > > >>> > > >>>> 17. if (ye.get() != 0.0 ||
> >>> > > y.isAddConstantTime()) {
> >>> > > >>> > > >>>> 18. ye.set(f.apply(ye.get(),
> xe.get()));
> >>> > > >>> > > >>>> 19. } else {
> >>> > > >>> > > >>>> 20. updates.set(xe.index(),
> >>> f.apply(ye.get(),
> >>> > > >>> > > >>>> xe.get()));
> >>> > > >>> > > >>>> 21. }
> >>> > > >>> > > >>>> 22. }
> >>> > > >>> > > >>>> 23. }
> >>> > > >>> > > >>>> 24. if (swap && !y.isAddConstantTime()) {
> >>> > > >>> > > >>>> 25. y.mergeUpdates(updates);
> >>> > > >>> > > >>>> 26. }
> >>> > > >>> > > >>>> 27. return swap ? y : x;
> >>> > > >>> > > >>>> 28. }
> >>> > > >>> > > >>>> 29. }
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> Split this into.
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> IterateThisLookupThatInplaceUpdate
> >>> > > >>> > > >>>> IterateThatLookupThisInplaceUpdate
> >>> > > >>> > > >>>> IterateThisLookupThatMergeUpdate
> >>> > > >>> > > >>>> IterateThatLookupThisMergeUpdate
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> Removing the conditionals itself will speedup the
> >>> operations
> >>> > > >>> you see
> >>> > > >>> > > >>>> regression on.
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> Also note that getElement() is not constant Add time
> for
> >>> > RASV.
> >>> > > >>> In
> >>> > > >>> > > >>>> earlier version we tried at best to set in place. That
> >>> > reduced
> >>> > > >>> the
> >>> > > >>> > > extra
> >>> > > >>> > > >>>> hashkey computation. getElement in AbstractVector is
> >>> really
> >>> > > >>> > optimized
> >>> > > >>> > > for
> >>> > > >>> > > >>>> DenseVector and it does indicate as the cause of
> >>> regression
> >>> > > you
> >>> > > >>> are
> >>> > > >>> > > seeing
> >>> > > >>> > > >>>> in RASV. Another one is the unused new
> >>> > > >>> OrderedIntDoubleMapping().
> >>> > > >>> > > Removing
> >>> > > >>> > > >>>> all these will bring back the loss.
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> Now the cost for each can be computed as: cost of
> >>> iteration
> >>> > +
> >>> > > >>> cost
> >>> > > >>> > of
> >>> > > >>> > > >>>> lookup + cost of update (if its not in-place)
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> See the attached file for a mock test. You will have
> to
> >>> > rework
> >>> > > >>> the
> >>> > > >>> > > >>>> expectations but the framework should be
> understandable.
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> ------
> >>> > > >>> > > >>>> Robin Anil
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon <
> >>> > > >>> > > >>>> [email protected]> wrote:
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>>> Here's [1] a link to the "design document" of the new
> >>> > vector
> >>> > > >>> > > >>>>> operations so I can lay out the ideas behind what I'm
> >>> doing
> >>> > > >>> more
> >>> > > >>> > > clearly.
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> I'd like anyone who can to have a look and comment.
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> Ted,
> >>> > > >>> > > >>>>> I know you'd like me to get back to working on
> >>> clustering,
> >>> > > but
> >>> > > >>> > > >>>>> currently, BallKMeans is roughly 2x as slow as
> Mahout's
> >>> > > KMeans.
> >>> > > >>> > > >>>>> This is because of centroid.update() for one (it
> >>> doesn't
> >>> > have
> >>> > > >>> > > >>>>> "interesting" properties that are exploited in the
> >>> current)
> >>> > > and
> >>> > > >>> > also
> >>> > > >>> > > >>>>> because we're using we're just doing lots of Vector
> >>> > > >>> operations: dot
> >>> > > >>> > > >>>>> products for projections, hash codes for unique sets
> of
> >>> > > >>> candidates,
> >>> > > >>> > > >>>>> equality checks etc.
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> Here's a snapshot of the hotspots for one run of
> >>> BallKMeans
> >>> > > (20
> >>> > > >>> > > >>>>> newsgroups, unprojected, i.e 20K sparse vectors with
> >>> 200K
> >>> > > >>> > dimensions
> >>> > > >>> > > and
> >>> > > >>> > > >>>>> ~100 nonzero values):
> >>> > > >>> > > >>>>> Name Time (ms)
> >>> > > >>> org.apache.mahout.math.AbstractVector.dot(Vector)
> >>> > > >>> > > >>>>> 104069
> >>> org.apache.mahout.math.DelegatingVector.hashCode()
> >>> > > >>> 30159
> >>> > > >>> > >
> >>> > org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator,
> >>> > > >>> > > >>>>> Iterator, DoubleDoubleFunction) 13720
> >>> > > >>> > > >>>>>
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping)
> >>> > > >>> > > >>>>> 2911 java.lang.ClassLoader.loadClass(String,
> boolean)
> >>> 2620
> >>> > > >>> > > sun.misc.Launcher$AppClassLoader.loadClass(String,
> >>> > > >>> > > >>>>> boolean) 2620
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> You can see that the most time is spent doing Vector
> >>> > > >>> operations.
> >>> > > >>> > > >>>>> Combined, the first three rows account for 91% of the
> >>> > > runtime.
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> Thanks for your help!
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> PS: Is this reaching dev@?
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> [1]
> >>> > > >>> > > >>>>>
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <
> >>> > > >>> [email protected]
> >>> > > >>> > > >wrote:
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>>> Few more brain dump ideas
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>> For any operation (assign, aggregate) you are trying
> >>> to
> >>> > do.
> >>> > > >>> > > >>>>>> Compute set of "feasible" solutions: Based on all
> the
> >>> > > >>> properties
> >>> > > >>> > of
> >>> > > >>> > > >>>>>> functions
> >>> > > >>> > > >>>>>> Choose the solution that minimises the cost.
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>> For this you may need to refactor all the special
> >>> kinds of
> >>> > > >>> > > iterations
> >>> > > >>> > > >>>>>> and aggregations subclassing from AssignOperation
> and
> >>> > > >>> > > AggregateOperation.
> >>> > > >>> > > >>>>>> Each operation would return the cost given two input
> >>> > vectors
> >>> > > >>> > (based
> >>> > > >>> > > on
> >>> > > >>> > > >>>>>> their cost properties)
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>> ------
> >>> > > >>> > > >>>>>> Robin Anil
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <
> >>> > > >>> [email protected]
> >>> > > >>> > > >wrote:
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>>> Strike that. I think the code is very confusing.
> You
> >>> will
> >>> > > >>> have
> >>> > > >>> > > >>>>>>> better luck if you create method for the three
> >>> > > >>> implementations.
> >>> > > >>> > It
> >>> > > >>> > > will
> >>> > > >>> > > >>>>>>> make things explicit and easy to debug.
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>> public final double getIterationCost() {
> >>> > > >>> > > >>>>>>> return one of {cardinality, 1.2*numNonDefault,
> >>> > > >>> numNonDefault}
> >>> > > >>> > > >>>>>>> }
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>> public final double getLookupCost() {
> >>> > > >>> > > >>>>>>> return one of {1, 1.5(lets say),
> >>> log_2(numNonDefault)}
> >>> > > >>> > > >>>>>>> }
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>> All your logic then becomes a minimization of the
> >>> total
> >>> > > cost.
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>> This is more extendable than current isDense(),
> >>> > > >>> isSequential(),
> >>> > > >>> > > >>>>>>> isSparse(), isRandom() and doesn't assume what the
> >>> > > underlying
> >>> > > >>> > > >>>>>>> implementation is.
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>> ------
> >>> > > >>> > > >>>>>>> Robin Anil
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <
> >>> > > >>> > [email protected]
> >>> > > >>> > > >wrote:
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>>> Also the calculation for cost is wrong. RASV is
> >>> HashMap
> >>> > > >>> based
> >>> > > >>> > its
> >>> > > >>> > > >>>>>>>> cost O(n) not O(log n)
> >>> > > >>> > > >>>>>>>>
> >>> > > >>> > > >>>>>>>> ------
> >>> > > >>> > > >>>>>>>> Robin Anil
> >>> > > >>> > > >>>>>>>>
> >>> > > >>> > > >>>>>>>>
> >>> > > >>> > > >>>>>>>>
> >>> > > >>> > > >>>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <
> >>> > > >>> > [email protected]
> >>> > > >>> > > >wrote:
> >>> > > >>> > > >>>>>>>>
> >>> > > >>> > > >>>>>>>>> This is an automatically generated e-mail. To
> >>> reply,
> >>> > > >>> visit:
> >>> > > >>> > > >>>>>>>>> https://reviews.apache.org/r/10669/
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> >>> > > >>> > >
> >>> > > >>>
> >>> > >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56>
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> None
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 55
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> boolean commutativeAndAssociative =
> >>> > > >>> > > aggregator.isCommutative() && aggregator.isAssociative();
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> This should be a method in AbstractFunction
> >>> interface
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037
> >>> > > >>> >
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 726,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 927,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 942
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> // Make all the non-zero values 0.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> vector.clear()
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047
> >>> > > >>> >
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 726,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 927,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 952
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> while (it.hasNext()) {
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> There is no guarantee of ordered iteration. The
> >>> > updates
> >>> > > >>> is
> >>> > > >>> > > unsorted. merge typically implies that input is sorted
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074
> >>> > > >>> >
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 726,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 927,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 979
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> element.set(values[index]);
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> What if there are values that are not in the
> >>> existing
> >>> > > >>> data. I
> >>> > > >>> > > dont this is correct
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100
> >>> > > >>> >
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 726,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
> >>> > 'line':
> >>> > > >>> 927,
> >>> > > >>> > > 'expand_offset': 2}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 999
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> invalidateCachedLength();
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> move invalidateCachedLength to the set method.
> >>> > Instead
> >>> > > of
> >>> > > >>> > > dispersing it everywhere.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115
> >>> > > >>> >
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> None
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': ' private Vector
> >>> > > >>> assignIterateBoth(Iterator<Element>
> >>> > > >>> > > thisIterator, Iterator<Element> thatIterator,', 'line':
> 1070}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 1014
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> protected Vector
> >>> > > >>> assignSkipZerosIterateOne(Iterator<Element>
> >>> > > >>> > > thisIterator, Vector other,
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> be consistent, this that or thisIterator that
> >>> > Iterator.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171
> >>> > > >>> >
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> None
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': ' private Vector
> >>> > > >>> assignIterateBoth(Iterator<Element>
> >>> > > >>> > > thisIterator, Iterator<Element> thatIterator,', 'line':
> 1070}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 1070
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> private Vector
> >>> assignIterateBoth(Iterator<Element>
> >>> > > >>> > > thisIterator, Iterator<Element> thatIterator,
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> This expects the data to to be sorted. indicate
> >>> that
> >>> > > in a
> >>> > > >>> > > comment. Similarly comment other places
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > >
> >>> > math/src/main/java/org/apache/mahout/math/function/Functions.java<
> >>> > > >>> > >
> >>> > > >>>
> >>> > >
> >>>
> https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900
> >>> > > >>> > >
> >>> > > >>> > > (Diff
> >>> > > >>> > > >>>>>>>>> revision 2)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> None
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> {'text': ' public boolean isLikeLeftMult()
> {',
> >>> > > 'line':
> >>> > > >>> > 892}
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> 896
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> /**
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> Move this comments to the base class
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> - Robin
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon
> wrote:
> >>> > > >>> > > >>>>>>>>> Review request for mahout, Ted Dunning,
> Sebastian
> >>> > > >>> Schelter,
> >>> > > >>> > and
> >>> > > >>> > > >>>>>>>>> Robin Anil.
> >>> > > >>> > > >>>>>>>>> By Dan Filimon.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> *Updated April 22, 2013, 8:45 p.m.*
> >>> > > >>> > > >>>>>>>>> Description
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> This patch contains code cleaning up
> >>> AbstractVector and
> >>> > > >>> making
> >>> > > >>> > > the operations as fast as possible while still having a
> high
> >>> > level
> >>> > > >>> > > interface.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> The main changes are in AbstractVector as well as
> >>> new
> >>> > > >>> methods
> >>> > > >>> > in
> >>> > > >>> > > DoubleDoubleFunction.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> Testing
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> The vectors test pass but it's likely that the
> >>> patch in
> >>> > > >>> it's
> >>> > > >>> > > current state is broken as other, unrelated tests
> >>> (BallKMeans...)
> >>> > > are
> >>> > > >>> > > failing.
> >>> > > >>> > > >>>>>>>>> Also, my Hadoop conf is broken so I didn't run
> all
> >>> the
> >>> > > core
> >>> > > >>> > > tests. Anyone?
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> I can't seem to find the bug, so _please_ have a
> >>> closer
> >>> > > >>> look.
> >>> > > >>> > > It's still a work in progress.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> The benchmarks seem comparable (although there
> are
> >>> some
> >>> > > >>> jarring
> >>> > > >>> > > diferences – Minkowski distance seems a lot slower in
> >>> new-dan-1
> >>> > > than
> >>> > > >>> > > old-trunk-2). It may be however that this is just variance
> >>> due to
> >>> > > the
> >>> > > >>> > load
> >>> > > >>> > > of the machine at the time. I'm having trouble interpreting
> >>> the
> >>> > > >>> > benchmarks
> >>> > > >>> > > in general, so anyone who could give me a hand is more than
> >>> > > welcome.
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> Diffs
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
> >>> > > >>> > > >>>>>>>>> (e12aa38)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> math/src/main/java/org/apache/mahout/math/AbstractVector.java
> >>> > > >>> > > >>>>>>>>> (090aa7a)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> math/src/main/java/org/apache/mahout/math/Centroid.java
> >>> > > >>> > > >>>>>>>>> (0c42196)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> math/src/main/java/org/apache/mahout/math/ConstantVector.java
> >>> > > >>> > > >>>>>>>>> (51d67d4)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> math/src/main/java/org/apache/mahout/math/DelegatingVector.java
> >>> > > >>> > > >>>>>>>>> (12220d4)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> math/src/main/java/org/apache/mahout/math/DenseVector.java
> >>> > > >>> > > >>>>>>>>> (41c356b)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
> >>> > > >>> > > >>>>>>>>> (094003b)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> math/src/main/java/org/apache/mahout/math/MatrixSlice.java
> >>> > > >>> > > >>>>>>>>> (7f79c96)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
> >>> > > >>> > > >>>>>>>>> (af70727)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> math/src/main/java/org/apache/mahout/math/NamedVector.java
> >>> > > >>> > > >>>>>>>>> (4b7a41d)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> > > >>>
> >>> >
> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
> >>> > > >>> > > >>>>>>>>> (650d82d)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
> >>> > > >>> > > >>>>>>>>> (d1ea93a)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> > > >>>
> >>> >
> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
> >>> > > >>> > > >>>>>>>>> (6f85692)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
> >>> > > >>> > > >>>>>>>>> (21982f9)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > math/src/main/java/org/apache/mahout/math/Vector.java
> >>> > > >>> > > >>>>>>>>> (2f8b417)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> math/src/main/java/org/apache/mahout/math/VectorView.java
> >>> > > >>> > > >>>>>>>>> (add2a60)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> math/src/main/java/org/apache/mahout/math/WeightedVector.java
> >>> > > >>> > > >>>>>>>>> (06fbd05)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > >
> >>> >
> >>>
> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
> >>> > > >>> > > >>>>>>>>> (82b350a)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> math/src/main/java/org/apache/mahout/math/function/Functions.java
> >>> > > >>> > > >>>>>>>>> (eb2e42f)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> math/src/main/java/org/apache/mahout/math/function/PlusMult.java
> >>> > > >>> > > >>>>>>>>> (60587b1)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> > >
> math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
> >>> > > >>> > > >>>>>>>>> (8ab0bb1)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> math/src/main/java/org/apache/mahout/math/jet/math/Constants.java
> >>> > > >>> > > >>>>>>>>> (53535d6)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> > >
> >>> math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java
> >>> > > >>> > > >>>>>>>>> (2b11199)
> >>> > > >>> > > >>>>>>>>> -
> >>> > > >>> math/src/test/java/org/apache/mahout/math/VectorTest.java
> >>> > > >>> > > >>>>>>>>> (d6d554b)
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>> View Diff <
> >>> https://reviews.apache.org/r/10669/diff/>
> >>> > > >>> > > >>>>>>>>>
> >>> > > >>> > > >>>>>>>>
> >>> > > >>> > > >>>>>>>>
> >>> > > >>> > > >>>>>>>
> >>> > > >>> > > >>>>>>
> >>> > > >>> > > >>>>>
> >>> > > >>> > > >>>>
> >>> > > >>> > > >>>
> >>> > > >>> > > >
> >>> > > >>> > >
> >>> > > >>> >
> >>> > > >>>
> >>> > > >>
> >>> > > >>
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>