Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

Dan Filimon Mon, 29 Apr 2013 08:14:56 -0700

I uploaded the latest patch to my vector branch on github [0].

The latest revision is, as always, here [1], "Version 7 (Assign/Aggregate;
1 sec runtime)". Tests pass for this one except for some aggregate tests
(it's not clear if aggregate should throw an exception when the vectors are
empty or return 0; I chose return 0).


There's quite a bit of variance, a lot of it even in code that I haven't
changed between runs. So, norm2 shows regressions in version 7 and
improvements in version 6 and the code not having been changed.

The dot product is also oscillating in version 7, showing different results
in the different runs (of the same algorithm).

Except for some cleanup, I don't think I can do much better than this, so
could someone please also run the benchmarks and I'll update the patch on
ReviewBoard?

[0] https://github.com/dfilimon/mahout/tree/vector
[1]
https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=7


On Sun, Apr 28, 2013 at 11:18 PM, Ted Dunning <[email protected]> wrote:

> It would *really* help to put a legend on these spreadsheets even if you
> have described it in the past.
>
> I take it that these are number of ops and bigger is better, right?
>
> And where is a description of what patch runs 1 and 2 actually are?
>
>
>
> On Sun, Apr 28, 2013 at 2:23 AM, Dan Filimon 
> <[email protected]>wrote:
>
>> Here's [1] the most recent benchmarks, after refactoring the Assign into
>> separate classes.
>> It's now using a 10 second runtime and a 1 second lead time and these are
>> only the benchmarks involving assign().
>>
>> Oddly enough there are regressions with RandomAccessSparseVectors' plus()
>> and minus() despite these working like trunk from what I can tell from the
>> tests (VectorBinaryAssignTest).
>>
>> Also, there's quite a bit of variance for some reason... Clone, Create
>>  are just two of the odd ones.
>>
>> [1]
>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4
>>
>>
>> On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil <[email protected]> wrote:
>>
>>> Do in place for rand an dense. Assign a bug to me to speed up rasv
>>> GetElement.
>>> On Apr 25, 2013 2:56 PM, "Dan Filimon" <[email protected]>
>>> wrote:
>>>
>>> > Nearly done splitting the code up, but I'm not sure what the costs
>>> should
>>> > ideally be.
>>> >
>>> > Robin, you proposed: cost of iteration + cost of lookup + cost of
>>> update
>>> > (if its not in-place)
>>> >
>>> > This sounds like it's per element, rather than for the entire vector.
>>> Also,
>>> > are we just going to assume that there will be as many updates as
>>> non-zeros
>>> > or as large as the array is? How does the cost of an update factor in?
>>> >
>>> > And, is "in-place" just for DenseVectors?
>>> >
>>> >
>>> > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil <[email protected]>
>>> wrote:
>>> >
>>> > > Depends on the speed of copying a 1M double array v/s doing a calloc
>>> +
>>> > > copying 1000 non zeros (Assuming java is doing that underneath).
>>> > >
>>> > > ------
>>> > > Robin Anil
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <
>>> > [email protected]
>>> > > >wrote:
>>> > >
>>> > > > Right, but is clone() generally slower than assigning? That
>>> strikes me
>>> > as
>>> > > > odd; doesn't Java optimize copying the internal structures (there
>>> are
>>> > > > arrays underneath after all)?
>>> > > >
>>> > > >
>>> > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <[email protected]
>>> >
>>> > > wrote:
>>> > > >
>>> > > >> Seems like for dense clone is slower than like().assign I need to
>>> test
>>> > > it
>>> > > >> with different sizes to be sure. I kept it from the old behavior.
>>> > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <
>>> [email protected]>
>>> > > >> wrote:
>>> > > >>
>>> > > >>> Okay, so I should split it further into smaller sub-cases that
>>> handle
>>> > > >>> each Vector type. I tried making so that these match the cases
>>> in the
>>> > > >>> document to the extent possible.
>>> > > >>> You're right. It is ugly and I need to split it up.
>>> > > >>>
>>> > > >>> I removed the OrderedIntDoubleMapping (but with another if...)
>>> but
>>> > one
>>> > > >>> more thing that helped is making the assign() only go through the
>>> > > non-zeros
>>> > > >>> when setting up a new copy.
>>> > > >>>
>>> > > >>> About the copying, I saw that the optimized copy doesn't always
>>> call
>>> > > >>> clone clone on the Vector. Why is this?
>>> > > >>>
>>> > > >>> Thanks for the test and the feedback! :)
>>> > > >>>
>>> > > >>>
>>> > > >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <
>>> [email protected]
>>> > > >wrote:
>>> > > >>>
>>> > > >>>> Yes its reaching dev. I checked out your code to take a look. I
>>> > > >>>> understand you are trying to still mix isSequentialAccess etc
>>> in the
>>> > > >>>> validity calculation. The whole point of this architecture is to
>>> > > completely
>>> > > >>>> unroll things to be more modular instead of creating a giant
>>> > spaghetti
>>> > > >>>>
>>> > > >>>> Here is an example of what I am trying to say.
>>> > > >>>>
>>> > > >>>> This is your current implementation of
>>> AssignIterateOneLookupOther.
>>> > > >>>> There are so many conditionals
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>    1. @Override
>>> > > >>>>    2.     public Vector assign(Vector x, Vector y,
>>> > > >>>>    DoubleDoubleFunction f) {
>>> > > >>>>    3.       return !swap ? assignInner(x, y, f) :
>>> assignInner(y, x,
>>> > > f);
>>> > > >>>>    4.     }
>>> > > >>>>    5.
>>> > > >>>>    6.     public Vector assignInner(Vector x, Vector y,
>>> > > >>>>    DoubleDoubleFunction f) {
>>> > > >>>>    7.       Iterator<Vector.Element> xi = x.iterateNonZero();
>>> > > >>>>    8.       Vector.Element xe;
>>> > > >>>>    9.       Vector.Element ye;
>>> > > >>>>    10.       OrderedIntDoubleMapping updates =
>>> > > newOrderedIntDoubleMapping();
>>> > > >>>>    11.       while (xi.hasNext()) {
>>> > > >>>>    12.         xe = xi.next();
>>> > > >>>>    13.         ye = y.getElement(xe.index());
>>> > > >>>>    14.         if (!swap) {
>>> > > >>>>    15.           xe.set(f.apply(xe.get(), ye.get()));
>>> > > >>>>    16.         } else {
>>> > > >>>>    17.           if (ye.get() != 0.0 || y.isAddConstantTime()) {
>>> > > >>>>    18.             ye.set(f.apply(ye.get(), xe.get()));
>>> > > >>>>    19.           } else {
>>> > > >>>>    20.             updates.set(xe.index(), f.apply(ye.get(),
>>> > > >>>>    xe.get()));
>>> > > >>>>    21.           }
>>> > > >>>>    22.         }
>>> > > >>>>    23.       }
>>> > > >>>>    24.       if (swap && !y.isAddConstantTime()) {
>>> > > >>>>    25.         y.mergeUpdates(updates);
>>> > > >>>>    26.       }
>>> > > >>>>    27.       return swap ? y : x;
>>> > > >>>>    28.     }
>>> > > >>>>    29.   }
>>> > > >>>>
>>> > > >>>> Split this into.
>>> > > >>>>
>>> > > >>>> IterateThisLookupThatInplaceUpdate
>>> > > >>>> IterateThatLookupThisInplaceUpdate
>>> > > >>>> IterateThisLookupThatMergeUpdate
>>> > > >>>> IterateThatLookupThisMergeUpdate
>>> > > >>>>
>>> > > >>>> Removing the conditionals itself will speedup the operations
>>> you see
>>> > > >>>> regression on.
>>> > > >>>>
>>> > > >>>> Also note that getElement() is not constant Add time for RASV.
>>> In
>>> > > >>>> earlier version we tried at best to set in place. That reduced
>>> the
>>> > > extra
>>> > > >>>> hashkey computation. getElement in AbstractVector is really
>>> > optimized
>>> > > for
>>> > > >>>> DenseVector and it does indicate as the cause of regression you
>>> are
>>> > > seeing
>>> > > >>>> in RASV. Another one is the unused new
>>> OrderedIntDoubleMapping().
>>> > > Removing
>>> > > >>>> all these will bring back the loss.
>>> > > >>>>
>>> > > >>>> Now the cost for each can be computed as: cost of iteration +
>>> cost
>>> > of
>>> > > >>>> lookup + cost of update (if its not in-place)
>>> > > >>>>
>>> > > >>>> See the attached file for a mock test. You will have to rework
>>> the
>>> > > >>>> expectations but the framework should be understandable.
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> ------
>>> > > >>>> Robin Anil
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon <
>>> > > >>>> [email protected]> wrote:
>>> > > >>>>
>>> > > >>>>> Here's [1] a link to the "design document" of the new vector
>>> > > >>>>> operations so I can lay out the ideas behind what I'm doing
>>> more
>>> > > clearly.
>>> > > >>>>>
>>> > > >>>>> I'd like anyone who can to have a look and comment.
>>> > > >>>>>
>>> > > >>>>> Ted,
>>> > > >>>>> I know you'd like me to get back to working on clustering, but
>>> > > >>>>> currently, BallKMeans is roughly 2x as slow as Mahout's KMeans.
>>> > > >>>>> This is because of centroid.update() for one (it doesn't have
>>> > > >>>>> "interesting" properties that are exploited in the current) and
>>> > also
>>> > > >>>>> because we're using we're just doing lots of Vector
>>> operations: dot
>>> > > >>>>> products for projections, hash codes for unique sets of
>>> candidates,
>>> > > >>>>> equality checks etc.
>>> > > >>>>>
>>> > > >>>>> Here's a snapshot of the hotspots for one run of BallKMeans (20
>>> > > >>>>> newsgroups, unprojected, i.e 20K sparse vectors with 200K
>>> > dimensions
>>> > > and
>>> > > >>>>> ~100 nonzero values):
>>> > > >>>>>   Name Time (ms)
>>>  org.apache.mahout.math.AbstractVector.dot(Vector)
>>> > > >>>>> 104069  org.apache.mahout.math.DelegatingVector.hashCode()
>>> 30159
>>> > >  org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator,
>>> > > >>>>> Iterator, DoubleDoubleFunction) 13720
>>> > > >>>>>
>>> > >
>>> >
>>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping)
>>> > > >>>>> 2911  java.lang.ClassLoader.loadClass(String, boolean) 2620
>>> > >  sun.misc.Launcher$AppClassLoader.loadClass(String,
>>> > > >>>>> boolean) 2620
>>> > > >>>>>
>>> > > >>>>>
>>> > > >>>>> You can see that the most time is spent doing Vector
>>> operations.
>>> > > >>>>> Combined, the first three rows account for 91% of the runtime.
>>> > > >>>>>
>>> > > >>>>> Thanks for your help!
>>> > > >>>>>
>>> > > >>>>> PS: Is this reaching dev@?
>>> > > >>>>>
>>> > > >>>>> [1]
>>> > > >>>>>
>>> > >
>>> >
>>> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#
>>> > > >>>>>
>>> > > >>>>>
>>> > > >>>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <
>>> [email protected]
>>> > > >wrote:
>>> > > >>>>>
>>> > > >>>>>> Few more brain dump ideas
>>> > > >>>>>>
>>> > > >>>>>> For any operation (assign, aggregate) you are trying to do.
>>> > > >>>>>> Compute set of "feasible" solutions: Based on all the
>>> properties
>>> > of
>>> > > >>>>>> functions
>>> > > >>>>>> Choose the solution that minimises the cost.
>>> > > >>>>>>
>>> > > >>>>>> For this you may need to refactor all the special kinds of
>>> > > iterations
>>> > > >>>>>> and aggregations subclassing from AssignOperation and
>>> > > AggregateOperation.
>>> > > >>>>>> Each operation would return the cost given two input vectors
>>> > (based
>>> > > on
>>> > > >>>>>> their cost properties)
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> ------
>>> > > >>>>>> Robin Anil
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>>
>>> > > >>>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <
>>> [email protected]
>>> > > >wrote:
>>> > > >>>>>>
>>> > > >>>>>>> Strike that. I think the code is very confusing. You will
>>> have
>>> > > >>>>>>> better luck if you create method for the three
>>> implementations.
>>> > It
>>> > > will
>>> > > >>>>>>> make things explicit and easy to debug.
>>> > > >>>>>>>
>>> > > >>>>>>> public final double getIterationCost() {
>>> > > >>>>>>>  return one of {cardinality, 1.2*numNonDefault,
>>> numNonDefault}
>>> > > >>>>>>> }
>>> > > >>>>>>>
>>> > > >>>>>>> public final double getLookupCost() {
>>> > > >>>>>>>  return one of {1, 1.5(lets say), log_2(numNonDefault)}
>>> > > >>>>>>> }
>>> > > >>>>>>>
>>> > > >>>>>>> All your logic then becomes a minimization of the total cost.
>>> > > >>>>>>>
>>> > > >>>>>>> This is more extendable than current isDense(),
>>> isSequential(),
>>> > > >>>>>>> isSparse(), isRandom() and doesn't assume what the underlying
>>> > > >>>>>>> implementation is.
>>> > > >>>>>>>
>>> > > >>>>>>>
>>> > > >>>>>>> ------
>>> > > >>>>>>> Robin Anil
>>> > > >>>>>>>
>>> > > >>>>>>>
>>> > > >>>>>>>
>>> > > >>>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <
>>> > [email protected]
>>> > > >wrote:
>>> > > >>>>>>>
>>> > > >>>>>>>>  Also the calculation for cost is wrong. RASV is HashMap
>>> based
>>> > its
>>> > > >>>>>>>> cost O(n) not O(log n)
>>> > > >>>>>>>>
>>> > > >>>>>>>> ------
>>> > > >>>>>>>> Robin Anil
>>> > > >>>>>>>>
>>> > > >>>>>>>>
>>> > > >>>>>>>>
>>> > > >>>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <
>>> > [email protected]
>>> > > >wrote:
>>> > > >>>>>>>>
>>> > > >>>>>>>>>    This is an automatically generated e-mail. To reply,
>>> visit:
>>> > > >>>>>>>>> https://reviews.apache.org/r/10669/
>>> > > >>>>>>>>>
>>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56>
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> None
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>    55
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>     boolean commutativeAndAssociative =
>>> > > aggregator.isCommutative() && aggregator.isAssociative();
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   This should be a method in AbstractFunction interface
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037
>>> >
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 726,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 927,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>    942
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>       // Make all the non-zero values 0.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   vector.clear()
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047
>>> >
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 726,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 927,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>    952
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>       while (it.hasNext()) {
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   There is no guarantee of ordered iteration. The updates
>>> is
>>> > > unsorted. merge typically implies that input is sorted
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074
>>> >
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 726,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 927,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>    979
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>         element.set(values[index]);
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   What if there are values that are not in the existing
>>> data. I
>>> > > dont this is correct
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100
>>> >
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 726,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line':
>>> 927,
>>> > > 'expand_offset': 2}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>    999
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>     invalidateCachedLength();
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   move invalidateCachedLength to the set method. Instead of
>>> > > dispersing it everywhere.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115
>>> >
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> None
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': '  private Vector
>>> assignIterateBoth(Iterator<Element>
>>> > > thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>     1014
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   protected Vector
>>> assignSkipZerosIterateOne(Iterator<Element>
>>> > > thisIterator, Vector other,
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   be consistent, this that or thisIterator that Iterator.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> >  math/src/main/java/org/apache/mahout/math/AbstractVector.java<
>>> > >
>>> >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171
>>> >
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> None
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': '  private Vector
>>> assignIterateBoth(Iterator<Element>
>>> > > thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>     1070
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   private Vector assignIterateBoth(Iterator<Element>
>>> > > thisIterator, Iterator<Element> thatIterator,
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   This expects the data to to be sorted. indicate that in a
>>> > > comment. Similarly comment other places
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > math/src/main/java/org/apache/mahout/math/function/Functions.java<
>>> > >
>>> https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900
>>> > >
>>> > > (Diff
>>> > > >>>>>>>>> revision 2)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> None
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> {'text': '      public boolean isLikeLeftMult() {', 'line':
>>> > 892}
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>    896
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>       /**
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   Move this comments to the base class
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> - Robin
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote:
>>> > > >>>>>>>>>   Review request for mahout, Ted Dunning, Sebastian
>>> Schelter,
>>> > and
>>> > > >>>>>>>>> Robin Anil.
>>> > > >>>>>>>>> By Dan Filimon.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> *Updated April 22, 2013, 8:45 p.m.*
>>> > > >>>>>>>>> Description
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> This patch contains code cleaning up AbstractVector and
>>> making
>>> > > the operations as fast as possible while still having a high level
>>> > > interface.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> The main changes are in AbstractVector as well as new
>>> methods
>>> > in
>>> > > DoubleDoubleFunction.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   Testing
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> The vectors test pass but it's likely that the patch in
>>> it's
>>> > > current state is broken as other, unrelated tests (BallKMeans...) are
>>> > > failing.
>>> > > >>>>>>>>> Also, my Hadoop conf is broken so I didn't run all the core
>>> > > tests. Anyone?
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> I can't seem to find the bug, so _please_ have a closer
>>> look.
>>> > > It's still a work in progress.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> The benchmarks seem comparable (although there are some
>>> jarring
>>> > > diferences – Minkowski distance seems a lot slower in new-dan-1 than
>>> > > old-trunk-2). It may be however that this is just variance due to the
>>> > load
>>> > > of the machine at the time. I'm having trouble interpreting the
>>> > benchmarks
>>> > > in general, so anyone who could give me a hand is more than welcome.
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>   Diffs
>>> > > >>>>>>>>>
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
>>> > > >>>>>>>>>    (e12aa38)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/AbstractVector.java
>>> > > >>>>>>>>>    (090aa7a)
>>> > > >>>>>>>>>    -
>>> math/src/main/java/org/apache/mahout/math/Centroid.java
>>> > > >>>>>>>>>    (0c42196)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/ConstantVector.java
>>> > > >>>>>>>>>    (51d67d4)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/DelegatingVector.java
>>> > > >>>>>>>>>    (12220d4)
>>> > > >>>>>>>>>    -
>>> math/src/main/java/org/apache/mahout/math/DenseVector.java
>>> > > >>>>>>>>>    (41c356b)
>>> > > >>>>>>>>>    -
>>> > >
>>> >
>>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
>>> > > >>>>>>>>>    (094003b)
>>> > > >>>>>>>>>    -
>>> math/src/main/java/org/apache/mahout/math/MatrixSlice.java
>>> > > >>>>>>>>>    (7f79c96)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
>>> > > >>>>>>>>>    (af70727)
>>> > > >>>>>>>>>    -
>>> math/src/main/java/org/apache/mahout/math/NamedVector.java
>>> > > >>>>>>>>>    (4b7a41d)
>>> > > >>>>>>>>>    -
>>> > >
>>> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
>>> > > >>>>>>>>>    (650d82d)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
>>> > > >>>>>>>>>    (d1ea93a)
>>> > > >>>>>>>>>    -
>>> > >
>>> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
>>> > > >>>>>>>>>    (6f85692)
>>> > > >>>>>>>>>    -
>>> > >
>>> >
>>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
>>> > > >>>>>>>>>    (21982f9)
>>> > > >>>>>>>>>    - math/src/main/java/org/apache/mahout/math/Vector.java
>>> > > >>>>>>>>>    (2f8b417)
>>> > > >>>>>>>>>    -
>>> math/src/main/java/org/apache/mahout/math/VectorView.java
>>> > > >>>>>>>>>    (add2a60)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/WeightedVector.java
>>> > > >>>>>>>>>    (06fbd05)
>>> > > >>>>>>>>>    -
>>> > >
>>> >
>>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
>>> > > >>>>>>>>>    (82b350a)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/function/Functions.java
>>> > > >>>>>>>>>    (eb2e42f)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/function/PlusMult.java
>>> > > >>>>>>>>>    (60587b1)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
>>> > > >>>>>>>>>    (8ab0bb1)
>>> > > >>>>>>>>>    -
>>> > > math/src/main/java/org/apache/mahout/math/jet/math/Constants.java
>>> > > >>>>>>>>>    (53535d6)
>>> > > >>>>>>>>>    -
>>> > > math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java
>>> > > >>>>>>>>>    (2b11199)
>>> > > >>>>>>>>>    -
>>> math/src/test/java/org/apache/mahout/math/VectorTest.java
>>> > > >>>>>>>>>    (d6d554b)
>>> > > >>>>>>>>>
>>> > > >>>>>>>>> View Diff <https://reviews.apache.org/r/10669/diff/>
>>> > > >>>>>>>>>
>>> > > >>>>>>>>
>>> > > >>>>>>>>
>>> > > >>>>>>>
>>> > > >>>>>>
>>> > > >>>>>
>>> > > >>>>
>>> > > >>>
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

Reply via email to