Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

Robin Anil Mon, 29 Apr 2013 19:46:30 -0700

After this change the benchmark actually takes about 41 minutes :)

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.



On Mon, Apr 29, 2013 at 9:45 PM, Robin Anil <[email protected]> wrote:

> After increasing the lead time to 15 seconds, I believe I am giving enough
> time for JIT to take place. This way we are measuring only the JITed code
> not the interpreted code. There is a flag in JVM with which you can change
> the threshold -XX:+CompileThreshold (default: 10000). I didnt want to mess
> with it to make it as close as possible to the real world.
>
> I have also disabled the cluster benchmarks as they are not executed
> enough times for JIT to compile them. I have to look more at them. for now
> ignore them for your patch.
>
> But you can see that variance is greatly reduced (See Sheet Version8)
>
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=8
>
> There are only a handful of regressions Norm1 (for Seq and Rand), Minus
> and Plus (for Rand) and a few others.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 29, 2013 at 11:37 AM, Robin Anil <[email protected]> wrote:
>
>> I have a few ideas to tweak around with the JIT to make the benchmarks
>> more stable. I will ping back once I get some time to do something.
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Mon, Apr 29, 2013 at 10:32 AM, Dan Filimon <
>> [email protected]> wrote:
>>
>>> That's the weirdness, it's _exactly_ the same code.
>>>
>>>
>>> On Mon, Apr 29, 2013 at 6:31 PM, Robin Anil <[email protected]>
>>> wrote:
>>>
>>> > I will pull your Patch in and take a look a close look at it tonight.
>>> Whats
>>> > different in Run3 in Version7 v/s Run2 and Run1. Because Run3 looks
>>> really
>>> > good and the regressions of 30% in some are actually not as bad and
>>> nothing
>>> > no longer looks ridiculously blocking the patch.
>>> >
>>> >
>>> > On Mon, Apr 29, 2013 at 10:13 AM, Dan Filimon
>>> > <[email protected]>wrote:
>>> >
>>> > > I uploaded the latest patch to my vector branch on github [0].
>>> > >
>>> > > The latest revision is, as always, here [1], "Version 7
>>> > (Assign/Aggregate;
>>> > > 1 sec runtime)". Tests pass for this one except for some aggregate
>>> tests
>>> > > (it's not clear if aggregate should throw an exception when the
>>> vectors
>>> > are
>>> > > empty or return 0; I chose return 0).
>>> > >
>>> > > There's quite a bit of variance, a lot of it even in code that I
>>> haven't
>>> > > changed between runs. So, norm2 shows regressions in version 7 and
>>> > > improvements in version 6 and the code not having been changed.
>>> > >
>>> > > The dot product is also oscillating in version 7, showing different
>>> > results
>>> > > in the different runs (of the same algorithm).
>>> > >
>>> > > Except for some cleanup, I don't think I can do much better than
>>> this, so
>>> > > could someone please also run the benchmarks and I'll update the
>>> patch on
>>> > > ReviewBoard?
>>> > >
>>> > > [0] https://github.com/dfilimon/mahout/tree/vector
>>> > > [1]
>>> > >
>>> > >
>>> >
>>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=7
>>> > >
>>> > >
>>> > > On Sun, Apr 28, 2013 at 11:18 PM, Ted Dunning <[email protected]
>>> >
>>> > > wrote:
>>> > >
>>> > > > It would *really* help to put a legend on these spreadsheets even
>>> if
>>> > you
>>> > > > have described it in the past.
>>> > > >
>>> > > > I take it that these are number of ops and bigger is better, right?
>>> > > >
>>> > > > And where is a description of what patch runs 1 and 2 actually are?
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Sun, Apr 28, 2013 at 2:23 AM, Dan Filimon <
>>> > > [email protected]>wrote:
>>> > > >
>>> > > >> Here's [1] the most recent benchmarks, after refactoring the
>>> Assign
>>> > into
>>> > > >> separate classes.
>>> > > >> It's now using a 10 second runtime and a 1 second lead time and
>>> these
>>> > > are
>>> > > >> only the benchmarks involving assign().
>>> > > >>
>>> > > >> Oddly enough there are regressions with RandomAccessSparseVectors'
>>> > > plus()
>>> > > >> and minus() despite these working like trunk from what I can tell
>>> from
>>> > > the
>>> > > >> tests (VectorBinaryAssignTest).
>>> > > >>
>>> > > >> Also, there's quite a bit of variance for some reason... Clone,
>>> Create
>>> > > >>  are just two of the odd ones.
>>> > > >>
>>> > > >> [1]
>>> > > >>
>>> > >
>>> >
>>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4
>>> > > >>
>>> > > >>
>>> > > >> On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil <[email protected]
>>> >
>>> > > wrote:
>>> > > >>
>>> > > >>> Do in place for rand an dense. Assign a bug to me to speed up
>>> rasv
>>> > > >>> GetElement.
>>> > > >>> On Apr 25, 2013 2:56 PM, "Dan Filimon" <
>>> [email protected]>
>>> > > >>> wrote:
>>> > > >>>
>>> > > >>> > Nearly done splitting the code up, but I'm not sure what the
>>> costs
>>> > > >>> should
>>> > > >>> > ideally be.
>>> > > >>> >
>>> > > >>> > Robin, you proposed: cost of iteration + cost of lookup + cost
>>> of
>>> > > >>> update
>>> > > >>> > (if its not in-place)
>>> > > >>> >
>>> > > >>> > This sounds like it's per element, rather than for the entire
>>> > vector.
>>> > > >>> Also,
>>> > > >>> > are we just going to assume that there will be as many updates
>>> as
>>> > > >>> non-zeros
>>> > > >>> > or as large as the array is? How does the cost of an update
>>> factor
>>> > > in?
>>> > > >>> >
>>> > > >>> > And, is "in-place" just for DenseVectors?
>>> > > >>> >
>>> > > >>> >
>>> > > >>> > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil <
>>> [email protected]>
>>> > > >>> wrote:
>>> > > >>> >
>>> > > >>> > > Depends on the speed of copying a 1M double array v/s doing a
>>> > > calloc
>>> > > >>> +
>>> > > >>> > > copying 1000 non zeros (Assuming java is doing that
>>> underneath).
>>> > > >>> > >
>>> > > >>> > > ------
>>> > > >>> > > Robin Anil
>>> > > >>> > >
>>> > > >>> > >
>>> > > >>> > >
>>> > > >>> > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <
>>> > > >>> > [email protected]
>>> > > >>> > > >wrote:
>>> > > >>> > >
>>> > > >>> > > > Right, but is clone() generally slower than assigning? That
>>> > > >>> strikes me
>>> > > >>> > as
>>> > > >>> > > > odd; doesn't Java optimize copying the internal structures
>>> > (there
>>> > > >>> are
>>> > > >>> > > > arrays underneath after all)?
>>> > > >>> > > >
>>> > > >>> > > >
>>> > > >>> > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <
>>> > > [email protected]
>>> > > >>> >
>>> > > >>> > > wrote:
>>> > > >>> > > >
>>> > > >>> > > >> Seems like for dense clone is slower than like().assign I
>>> need
>>> > > to
>>> > > >>> test
>>> > > >>> > > it
>>> > > >>> > > >> with different sizes to be sure. I kept it from the old
>>> > > behavior.
>>> > > >>> > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <
>>> > > >>> [email protected]>
>>> > > >>> > > >> wrote:
>>> > > >>> > > >>
>>> > > >>> > > >>> Okay, so I should split it further into smaller sub-cases
>>> > that
>>> > > >>> handle
>>> > > >>> > > >>> each Vector type. I tried making so that these match the
>>> > cases
>>> > > >>> in the
>>> > > >>> > > >>> document to the extent possible.
>>> > > >>> > > >>> You're right. It is ugly and I need to split it up.
>>> > > >>> > > >>>
>>> > > >>> > > >>> I removed the OrderedIntDoubleMapping (but with another
>>> > if...)
>>> > > >>> but
>>> > > >>> > one
>>> > > >>> > > >>> more thing that helped is making the assign() only go
>>> through
>>> > > the
>>> > > >>> > > non-zeros
>>> > > >>> > > >>> when setting up a new copy.
>>> > > >>> > > >>>
>>> > > >>> > > >>> About the copying, I saw that the optimized copy doesn't
>>> > always
>>> > > >>> call
>>> > > >>> > > >>> clone clone on the Vector. Why is this?
>>> > > >>> > > >>>
>>> > > >>> > > >>> Thanks for the test and the feedback! :)
>>> > > >>> > > >>>
>>> > > >>> > > >>>
>>> > > >>> > > >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <
>>> > > >>> [email protected]
>>> > > >>> > > >wrote:
>>> > > >>> > > >>>
>>> > > >>> > > >>>> Yes its reaching dev. I checked out your code to take a
>>> > look.
>>> > > I
>>> > > >>> > > >>>> understand you are trying to still mix
>>> isSequentialAccess
>>> > etc
>>> > > >>> in the
>>> > > >>> > > >>>> validity calculation. The whole point of this
>>> architecture
>>> > is
>>> > > to
>>> > > >>> > > completely
>>> > > >>> > > >>>> unroll things to be more modular instead of creating a
>>> giant
>>> > > >>> > spaghetti
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> Here is an example of what I am trying to say.
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> This is your current implementation of
>>> > > >>> AssignIterateOneLookupOther.
>>> > > >>> > > >>>> There are so many conditionals
>>> > > >>> > > >>>>
>>> > > >>> > > >>>>
>>> > > >>> > > >>>>    1. @Override
>>> > > >>> > > >>>>    2.     public Vector assign(Vector x, Vector y,
>>> > > >>> > > >>>>    DoubleDoubleFunction f) {
>>> > > >>> > > >>>>    3.       return !swap ? assignInner(x, y, f) :
>>> > > >>> assignInner(y, x,
>>> > > >>> > > f);
>>> > > >>> > > >>>>    4.     }
>>> > > >>> > > >>>>    5.
>>> > > >>> > > >>>>    6.     public Vector assignInner(Vector x, Vector y,
>>> > > >>> > > >>>>    DoubleDoubleFunction f) {
>>> > > >>> > > >>>>    7.       Iterator<Vector.Element> xi =
>>> > x.iterateNonZero();
>>> > > >>> > > >>>>    8.       Vector.Element xe;
>>> > > >>> > > >>>>    9.       Vector.Element ye;
>>> > > >>> > > >>>>    10.       OrderedIntDoubleMapping updates =
>>> > > >>> > > newOrderedIntDoubleMapping();
>>> > > >>> > > >>>>    11.       while (xi.hasNext()) {
>>> > > >>> > > >>>>    12.         xe = xi.next();
>>> > > >>> > > >>>>    13.         ye = y.getElement(xe.index());
>>> > > >>> > > >>>>    14.         if (!swap) {
>>> > > >>> > > >>>>    15.           xe.set(f.apply(xe.get(), ye.get()));
>>> > > >>> > > >>>>    16.         } else {
>>> > > >>> > > >>>>    17.           if (ye.get() != 0.0 ||
>>> > > y.isAddConstantTime()) {
>>> > > >>> > > >>>>    18.             ye.set(f.apply(ye.get(), xe.get()));
>>> > > >>> > > >>>>    19.           } else {
>>> > > >>> > > >>>>    20.             updates.set(xe.index(),
>>> f.apply(ye.get(),
>>> > > >>> > > >>>>    xe.get()));
>>> > > >>> > > >>>>    21.           }
>>> > > >>> > > >>>>    22.         }
>>> > > >>> > > >>>>    23.       }
>>> > > >>> > > >>>>    24.       if (swap && !y.isAddConstantTime()) {
>>> > > >>> > > >>>>    25.         y.mergeUpdates(updates);
>>> > > >>> > > >>>>    26.       }
>>> > > >>> > > >>>>    27.       return swap ? y : x;
>>> > > >>> > > >>>>    28.     }
>>> > > >>> > > >>>>    29.   }
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> Split this into.
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> IterateThisLookupThatInplaceUpdate
>>> > > >>> > > >>>> IterateThatLookupThisInplaceUpdate
>>> > > >>> > > >>>> IterateThisLookupThatMergeUpdate
>>> > > >>> > > >>>> IterateThatLookupThisMergeUpdate
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> Removing the conditionals itself will speedup the
>>> operations
>>> > > >>> you see
>>> > > >>> > > >>>> regression on.
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> Also note that getElement() is not constant Add time for
>>> > RASV.
>>> > > >>> In
>>> > > >>> > > >>>> earlier version we tried at best to set in place. That
>>> > reduced
>>> > > >>> the
>>> > > >>> > > extra
>>> > > >>> > > >>>> hashkey computation. getElement in AbstractVector is
>>> really
>>> > > >>> > optimized
>>> > > >>> > > for
>>> > > >>> > > >>>> DenseVector and it does indicate as the cause of
>>> regression
>>> > > you
>>> > > >>> are
>>> > > >>> > > seeing
>>> > > >>> > > >>>> in RASV. Another one is the unused new
>>> > > >>> OrderedIntDoubleMapping().
>>> > > >>> > > Removing
>>> > > >>> > > >>>> all these will bring back the loss.
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> Now the cost for each can be computed as: cost of
>>> iteration
>>> > +
>>> > > >>> cost
>>> > > >>> > of
>>> > > >>> > > >>>> lookup + cost of update (if its not in-place)
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> See the attached file for a mock test. You will have to
>>> > rework
>>> > > >>> the
>>> > > >>> > > >>>> expectations but the framework should be understandable.
>>> > > >>> > > >>>>
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> ------
>>> > > >>> > > >>>> Robin Anil
>>> > > >>> > > >>>>
>>> > > >>> > > >>>>
>>> > > >>> > > >>>>
>>> > > >>> > > >>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon <
>>> > > >>> > > >>>> [email protected]> wrote:
>>> > > >>> > > >>>>
>>> > > >>> > > >>>>> Here's [1] a link to the "design document" of the new
>>> > vector
>>> > > >>> > > >>>>> operations so I can lay out the ideas behind what I'm
>>> doing
>>> > > >>> more
>>> > > >>> > > clearly.
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> I'd like anyone who can to have a look and comment.
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> Ted,
>>> > > >>> > > >>>>> I know you'd like me to get back to working on
>>> clustering,
>>> > > but
>>> > > >>> > > >>>>> currently, BallKMeans is roughly 2x as slow as Mahout's
>>> > > KMeans.
>>> > > >>> > > >>>>> This is because of centroid.update() for one (it
>>> doesn't
>>> > have
>>> > > >>> > > >>>>> "interesting" properties that are exploited in the
>>> current)
>>> > > and
>>> > > >>> > also
>>> > > >>> > > >>>>> because we're using we're just doing lots of Vector
>>> > > >>> operations: dot
>>> > > >>> > > >>>>> products for projections, hash codes for unique sets of
>>> > > >>> candidates,
>>> > > >>> > > >>>>> equality checks etc.
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> Here's a snapshot of the hotspots for one run of
>>> BallKMeans
>>> > > (20
>>> > > >>> > > >>>>> newsgroups, unprojected, i.e 20K sparse vectors with
>>> 200K
>>> > > >>> > dimensions
>>> > > >>> > > and
>>> > > >>> > > >>>>> ~100 nonzero values):
>>> > > >>> > > >>>>>   Name Time (ms)
>>> > > >>>  org.apache.mahout.math.AbstractVector.dot(Vector)
>>> > > >>> > > >>>>> 104069
>>>  org.apache.mahout.math.DelegatingVector.hashCode()
>>> > > >>> 30159
>>> > > >>> > >
>>> >  org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator,
>>> > > >>> > > >>>>> Iterator, DoubleDoubleFunction) 13720
>>> > > >>> > > >>>>>
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping)
>>> > > >>> > > >>>>> 2911  java.lang.ClassLoader.loadClass(String, boolean)
>>> 2620
>>> > > >>> > >  sun.misc.Launcher$AppClassLoader.loadClass(String,
>>> > > >>> > > >>>>> boolean) 2620
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> You can see that the most time is spent doing Vector
>>> > > >>> operations.
>>> > > >>> > > >>>>> Combined, the first three rows account for 91% of the
>>> > > runtime.
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> Thanks for your help!
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> PS: Is this reaching dev@?
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> [1]
>>> > > >>> > > >>>>>
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <
>>> > > >>> [email protected]
>>> > > >>> > > >wrote:
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>>> Few more brain dump ideas
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>> For any operation (assign, aggregate) you are trying
>>> to
>>> > do.
>>> > > >>> > > >>>>>> Compute set of "feasible" solutions: Based on all the
>>> > > >>> properties
>>> > > >>> > of
>>> > > >>> > > >>>>>> functions
>>> > > >>> > > >>>>>> Choose the solution that minimises the cost.
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>> For this you may need to refactor all the special
>>> kinds of
>>> > > >>> > > iterations
>>> > > >>> > > >>>>>> and aggregations subclassing from AssignOperation and
>>> > > >>> > > AggregateOperation.
>>> > > >>> > > >>>>>> Each operation would return the cost given two input
>>> > vectors
>>> > > >>> > (based
>>> > > >>> > > on
>>> > > >>> > > >>>>>> their cost properties)
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>> ------
>>> > > >>> > > >>>>>> Robin Anil
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <
>>> > > >>> [email protected]
>>> > > >>> > > >wrote:
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>>> Strike that. I think the code is very confusing. You
>>> will
>>> > > >>> have
>>> > > >>> > > >>>>>>> better luck if you create method for the three
>>> > > >>> implementations.
>>> > > >>> > It
>>> > > >>> > > will
>>> > > >>> > > >>>>>>> make things explicit and easy to debug.
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>> public final double getIterationCost() {
>>> > > >>> > > >>>>>>>  return one of {cardinality, 1.2*numNonDefault,
>>> > > >>> numNonDefault}
>>> > > >>> > > >>>>>>> }
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>> public final double getLookupCost() {
>>> > > >>> > > >>>>>>>  return one of {1, 1.5(lets say),
>>> log_2(numNonDefault)}
>>> > > >>> > > >>>>>>> }
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>> All your logic then becomes a minimization of the
>>> total
>>> > > cost.
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>> This is more extendable than current isDense(),
>>> > > >>> isSequential(),
>>> > > >>> > > >>>>>>> isSparse(), isRandom() and doesn't assume what the
>>> > > underlying
>>> > > >>> > > >>>>>>> implementation is.
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>> ------
>>> > > >>> > > >>>>>>> Robin Anil
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <
>>> > > >>> > [email protected]
>>> > > >>> > > >wrote:
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>>>  Also the calculation for cost is wrong. RASV is
>>> HashMap
>>> > > >>> based
>>> > > >>> > its
>>> > > >>> > > >>>>>>>> cost O(n) not O(log n)
>>> > > >>> > > >>>>>>>>
>>> > > >>> > > >>>>>>>> ------
>>> > > >>> > > >>>>>>>> Robin Anil
>>> > > >>> > > >>>>>>>>
>>> > > >>> > > >>>>>>>>
>>> > > >>> > > >>>>>>>>
>>> > > >>> > > >>>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <
>>> > > >>> > [email protected]
>>> > > >>> > > >wrote:
>>> > > >>> > > >>>>>>>>
>>> > > >>> > > >>>>>>>>>    This is an automatically generated e-mail. To
>>> reply,
>>> > > >>> visit:
>>> > > >>> > > >>>>>>>>> https://reviews.apache.org/r/10669/
>>> > > >>> > > >>>>>>>>>
>>> > > >>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > > >>> > >
>>> > > >>>
>>> > >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56>
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> None
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>    55
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>     boolean commutativeAndAssociative =
>>> > > >>> > > aggregator.isCommutative() && aggregator.isAssociative();
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   This should be a method in AbstractFunction
>>> interface
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037
>>> > > >>> >
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 726,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 927,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>    942
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>       // Make all the non-zero values 0.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   vector.clear()
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047
>>> > > >>> >
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 726,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 927,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>    952
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>       while (it.hasNext()) {
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   There is no guarantee of ordered iteration. The
>>> > updates
>>> > > >>> is
>>> > > >>> > > unsorted. merge typically implies that input is sorted
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074
>>> > > >>> >
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 726,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 927,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>    979
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>         element.set(values[index]);
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   What if there are values that are not in the
>>> existing
>>> > > >>> data. I
>>> > > >>> > > dont this is correct
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100
>>> > > >>> >
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 726,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {',
>>> > 'line':
>>> > > >>> 927,
>>> > > >>> > > 'expand_offset': 2}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>    999
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>     invalidateCachedLength();
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   move invalidateCachedLength to the set method.
>>> > Instead
>>> > > of
>>> > > >>> > > dispersing it everywhere.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115
>>> > > >>> >
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> None
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': '  private Vector
>>> > > >>> assignIterateBoth(Iterator<Element>
>>> > > >>> > > thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>     1014
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   protected Vector
>>> > > >>> assignSkipZerosIterateOne(Iterator<Element>
>>> > > >>> > > thisIterator, Vector other,
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   be consistent, this that or thisIterator that
>>> > Iterator.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171
>>> > > >>> >
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> None
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': '  private Vector
>>> > > >>> assignIterateBoth(Iterator<Element>
>>> > > >>> > > thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>     1070
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   private Vector
>>> assignIterateBoth(Iterator<Element>
>>> > > >>> > > thisIterator, Iterator<Element> thatIterator,
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   This expects the data to to be sorted. indicate
>>> that
>>> > > in a
>>> > > >>> > > comment. Similarly comment other places
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > >
>>> > math/src/main/java/org/apache/mahout/math/function/Functions.java<
>>> > > >>> > >
>>> > > >>>
>>> > >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900
>>> > > >>> > >
>>> > > >>> > > (Diff
>>> > > >>> > > >>>>>>>>> revision 2)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> None
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> {'text': '      public boolean isLikeLeftMult() {',
>>> > > 'line':
>>> > > >>> > 892}
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>    896
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>       /**
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   Move this comments to the base class
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> - Robin
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote:
>>> > > >>> > > >>>>>>>>>   Review request for mahout, Ted Dunning, Sebastian
>>> > > >>> Schelter,
>>> > > >>> > and
>>> > > >>> > > >>>>>>>>> Robin Anil.
>>> > > >>> > > >>>>>>>>> By Dan Filimon.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> *Updated April 22, 2013, 8:45 p.m.*
>>> > > >>> > > >>>>>>>>> Description
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> This patch contains code cleaning up
>>> AbstractVector and
>>> > > >>> making
>>> > > >>> > > the operations as fast as possible while still having a high
>>> > level
>>> > > >>> > > interface.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> The main changes are in AbstractVector as well as
>>> new
>>> > > >>> methods
>>> > > >>> > in
>>> > > >>> > > DoubleDoubleFunction.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   Testing
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> The vectors test pass but it's likely that the
>>> patch in
>>> > > >>> it's
>>> > > >>> > > current state is broken as other, unrelated tests
>>> (BallKMeans...)
>>> > > are
>>> > > >>> > > failing.
>>> > > >>> > > >>>>>>>>> Also, my Hadoop conf is broken so I didn't run all
>>> the
>>> > > core
>>> > > >>> > > tests. Anyone?
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> I can't seem to find the bug, so _please_ have a
>>> closer
>>> > > >>> look.
>>> > > >>> > > It's still a work in progress.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> The benchmarks seem comparable (although there are
>>> some
>>> > > >>> jarring
>>> > > >>> > > diferences – Minkowski distance seems a lot slower in
>>> new-dan-1
>>> > > than
>>> > > >>> > > old-trunk-2). It may be however that this is just variance
>>> due to
>>> > > the
>>> > > >>> > load
>>> > > >>> > > of the machine at the time. I'm having trouble interpreting
>>> the
>>> > > >>> > benchmarks
>>> > > >>> > > in general, so anyone who could give me a hand is more than
>>> > > welcome.
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>   Diffs
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > > math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
>>> > > >>> > > >>>>>>>>>    (e12aa38)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > > math/src/main/java/org/apache/mahout/math/AbstractVector.java
>>> > > >>> > > >>>>>>>>>    (090aa7a)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> math/src/main/java/org/apache/mahout/math/Centroid.java
>>> > > >>> > > >>>>>>>>>    (0c42196)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > > math/src/main/java/org/apache/mahout/math/ConstantVector.java
>>> > > >>> > > >>>>>>>>>    (51d67d4)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> math/src/main/java/org/apache/mahout/math/DelegatingVector.java
>>> > > >>> > > >>>>>>>>>    (12220d4)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> math/src/main/java/org/apache/mahout/math/DenseVector.java
>>> > > >>> > > >>>>>>>>>    (41c356b)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
>>> > > >>> > > >>>>>>>>>    (094003b)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> math/src/main/java/org/apache/mahout/math/MatrixSlice.java
>>> > > >>> > > >>>>>>>>>    (7f79c96)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
>>> > > >>> > > >>>>>>>>>    (af70727)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> math/src/main/java/org/apache/mahout/math/NamedVector.java
>>> > > >>> > > >>>>>>>>>    (4b7a41d)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> > > >>>
>>> > math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
>>> > > >>> > > >>>>>>>>>    (650d82d)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
>>> > > >>> > > >>>>>>>>>    (d1ea93a)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> > > >>>
>>> > math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
>>> > > >>> > > >>>>>>>>>    (6f85692)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
>>> > > >>> > > >>>>>>>>>    (21982f9)
>>> > > >>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/Vector.java
>>> > > >>> > > >>>>>>>>>    (2f8b417)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> math/src/main/java/org/apache/mahout/math/VectorView.java
>>> > > >>> > > >>>>>>>>>    (add2a60)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > > math/src/main/java/org/apache/mahout/math/WeightedVector.java
>>> > > >>> > > >>>>>>>>>    (06fbd05)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > >
>>> >
>>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
>>> > > >>> > > >>>>>>>>>    (82b350a)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> math/src/main/java/org/apache/mahout/math/function/Functions.java
>>> > > >>> > > >>>>>>>>>    (eb2e42f)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> math/src/main/java/org/apache/mahout/math/function/PlusMult.java
>>> > > >>> > > >>>>>>>>>    (60587b1)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> > > math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
>>> > > >>> > > >>>>>>>>>    (8ab0bb1)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> math/src/main/java/org/apache/mahout/math/jet/math/Constants.java
>>> > > >>> > > >>>>>>>>>    (53535d6)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> > >
>>> math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java
>>> > > >>> > > >>>>>>>>>    (2b11199)
>>> > > >>> > > >>>>>>>>>    -
>>> > > >>> math/src/test/java/org/apache/mahout/math/VectorTest.java
>>> > > >>> > > >>>>>>>>>    (d6d554b)
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>> View Diff <
>>> https://reviews.apache.org/r/10669/diff/>
>>> > > >>> > > >>>>>>>>>
>>> > > >>> > > >>>>>>>>
>>> > > >>> > > >>>>>>>>
>>> > > >>> > > >>>>>>>
>>> > > >>> > > >>>>>>
>>> > > >>> > > >>>>>
>>> > > >>> > > >>>>
>>> > > >>> > > >>>
>>> > > >>> > > >
>>> > > >>> > >
>>> > > >>> >
>>> > > >>>
>>> > > >>
>>> > > >>
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

Reply via email to