Okay, so I should split it further into smaller sub-cases that handle each
Vector type. I tried making so that these match the cases in the document
to the extent possible.
You're right. It is ugly and I need to split it up.

I removed the OrderedIntDoubleMapping (but with another if...) but one more
thing that helped is making the assign() only go through the non-zeros when
setting up a new copy.

About the copying, I saw that the optimized copy doesn't always call clone
clone on the Vector. Why is this?

Thanks for the test and the feedback! :)


On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <[email protected]> wrote:

> Yes its reaching dev. I checked out your code to take a look. I understand
> you are trying to still mix isSequentialAccess etc in the validity
> calculation. The whole point of this architecture is to completely unroll
> things to be more modular instead of creating a giant spaghetti
>
> Here is an example of what I am trying to say.
>
> This is your current implementation of AssignIterateOneLookupOther. There
> are so many conditionals
>
>
>    1. @Override
>    2.     public Vector assign(Vector x, Vector y, DoubleDoubleFunction
>    f) {
>    3.       return !swap ? assignInner(x, y, f) : assignInner(y, x, f);
>    4.     }
>    5.
>    6.     public Vector assignInner(Vector x, Vector y,
>    DoubleDoubleFunction f) {
>    7.       Iterator<Vector.Element> xi = x.iterateNonZero();
>    8.       Vector.Element xe;
>    9.       Vector.Element ye;
>    10.       OrderedIntDoubleMapping updates = newOrderedIntDoubleMapping();
>    11.       while (xi.hasNext()) {
>    12.         xe = xi.next();
>    13.         ye = y.getElement(xe.index());
>    14.         if (!swap) {
>    15.           xe.set(f.apply(xe.get(), ye.get()));
>    16.         } else {
>    17.           if (ye.get() != 0.0 || y.isAddConstantTime()) {
>    18.             ye.set(f.apply(ye.get(), xe.get()));
>    19.           } else {
>    20.             updates.set(xe.index(), f.apply(ye.get(), xe.get()));
>    21.           }
>    22.         }
>    23.       }
>    24.       if (swap && !y.isAddConstantTime()) {
>    25.         y.mergeUpdates(updates);
>    26.       }
>    27.       return swap ? y : x;
>    28.     }
>    29.   }
>
> Split this into.
>
> IterateThisLookupThatInplaceUpdate
> IterateThatLookupThisInplaceUpdate
> IterateThisLookupThatMergeUpdate
> IterateThatLookupThisMergeUpdate
>
> Removing the conditionals itself will speedup the operations you see
> regression on.
>
> Also note that getElement() is not constant Add time for RASV. In earlier
> version we tried at best to set in place. That reduced the extra hashkey
> computation. getElement in AbstractVector is really optimized for
> DenseVector and it does indicate as the cause of regression you are seeing
> in RASV. Another one is the unused new OrderedIntDoubleMapping(). Removing
> all these will bring back the loss.
>
> Now the cost for each can be computed as: cost of iteration + cost of
> lookup + cost of update (if its not in-place)
>
> See the attached file for a mock test. You will have to rework the
> expectations but the framework should be understandable.
>
>
> ------
> Robin Anil
>
>
>
> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon 
> <[email protected]>wrote:
>
>> Here's [1] a link to the "design document" of the new vector operations
>> so I can lay out the ideas behind what I'm doing more clearly.
>>
>> I'd like anyone who can to have a look and comment.
>>
>> Ted,
>> I know you'd like me to get back to working on clustering, but currently,
>> BallKMeans is roughly 2x as slow as Mahout's KMeans.
>> This is because of centroid.update() for one (it doesn't have
>> "interesting" properties that are exploited in the current) and also
>> because we're using we're just doing lots of Vector operations: dot
>> products for projections, hash codes for unique sets of candidates,
>> equality checks etc.
>>
>> Here's a snapshot of the hotspots for one run of BallKMeans (20
>> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions and
>> ~100 nonzero values):
>>   Name Time (ms)  org.apache.mahout.math.AbstractVector.dot(Vector)
>> 104069  org.apache.mahout.math.DelegatingVector.hashCode() 30159  
>> org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator,
>> Iterator, DoubleDoubleFunction) 13720
>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping)
>> 2911  java.lang.ClassLoader.loadClass(String, boolean) 2620  
>> sun.misc.Launcher$AppClassLoader.loadClass(String,
>> boolean) 2620
>>
>>
>> You can see that the most time is spent doing Vector operations.
>> Combined, the first three rows account for 91% of the runtime.
>>
>> Thanks for your help!
>>
>> PS: Is this reaching dev@?
>>
>> [1]
>> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#
>>
>>
>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <[email protected]> wrote:
>>
>>> Few more brain dump ideas
>>>
>>> For any operation (assign, aggregate) you are trying to do.
>>> Compute set of "feasible" solutions: Based on all the properties of
>>> functions
>>> Choose the solution that minimises the cost.
>>>
>>> For this you may need to refactor all the special kinds of iterations
>>> and aggregations subclassing from AssignOperation and AggregateOperation.
>>> Each operation would return the cost given two input vectors (based on
>>> their cost properties)
>>>
>>>
>>>
>>> ------
>>> Robin Anil
>>>
>>>
>>>
>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <[email protected]>wrote:
>>>
>>>> Strike that. I think the code is very confusing. You will have better
>>>> luck if you create method for the three implementations. It will make
>>>> things explicit and easy to debug.
>>>>
>>>> public final double getIterationCost() {
>>>>  return one of {cardinality, 1.2*numNonDefault, numNonDefault}
>>>> }
>>>>
>>>> public final double getLookupCost() {
>>>>  return one of {1, 1.5(lets say), log_2(numNonDefault)}
>>>> }
>>>>
>>>> All your logic then becomes a minimization of the total cost.
>>>>
>>>> This is more extendable than current isDense(), isSequential(),
>>>> isSparse(), isRandom() and doesn't assume what the underlying
>>>> implementation is.
>>>>
>>>>
>>>> ------
>>>> Robin Anil
>>>>
>>>>
>>>>
>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <[email protected]>wrote:
>>>>
>>>>>  Also the calculation for cost is wrong. RASV is HashMap based its
>>>>> cost O(n) not O(log n)
>>>>>
>>>>> ------
>>>>> Robin Anil
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <[email protected]>wrote:
>>>>>
>>>>>>    This is an automatically generated e-mail. To reply, visit:
>>>>>> https://reviews.apache.org/r/10669/
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> None
>>>>>>
>>>>>>    55
>>>>>>
>>>>>>     boolean commutativeAndAssociative = aggregator.isCommutative() && 
>>>>>> aggregator.isAssociative();
>>>>>>
>>>>>>   This should be a method in AbstractFunction interface
>>>>>>
>>>>>>
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>>    942
>>>>>>
>>>>>>       // Make all the non-zero values 0.
>>>>>>
>>>>>>   vector.clear()
>>>>>>
>>>>>>
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>>    952
>>>>>>
>>>>>>       while (it.hasNext()) {
>>>>>>
>>>>>>   There is no guarantee of ordered iteration. The updates is unsorted. 
>>>>>> merge typically implies that input is sorted
>>>>>>
>>>>>>
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>>    979
>>>>>>
>>>>>>         element.set(values[index]);
>>>>>>
>>>>>>   What if there are values that are not in the existing data. I dont 
>>>>>> this is correct
>>>>>>
>>>>>>
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, 
>>>>>> 'expand_offset': 2}
>>>>>>
>>>>>>    999
>>>>>>
>>>>>>     invalidateCachedLength();
>>>>>>
>>>>>>   move invalidateCachedLength to the set method. Instead of dispersing 
>>>>>> it everywhere.
>>>>>>
>>>>>>
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> None
>>>>>>
>>>>>> {'text': '  private Vector assignIterateBoth(Iterator<Element> 
>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>>>>>
>>>>>>     1014
>>>>>>
>>>>>>   protected Vector assignSkipZerosIterateOne(Iterator<Element> 
>>>>>> thisIterator, Vector other,
>>>>>>
>>>>>>   be consistent, this that or thisIterator that Iterator.
>>>>>>
>>>>>>
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> None
>>>>>>
>>>>>> {'text': '  private Vector assignIterateBoth(Iterator<Element> 
>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>>>>>
>>>>>>     1070
>>>>>>
>>>>>>   private Vector assignIterateBoth(Iterator<Element> thisIterator, 
>>>>>> Iterator<Element> thatIterator,
>>>>>>
>>>>>>   This expects the data to to be sorted. indicate that in a comment. 
>>>>>> Similarly comment other places
>>>>>>
>>>>>>
>>>>>>    
>>>>>> math/src/main/java/org/apache/mahout/math/function/Functions.java<https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900>
>>>>>>  (Diff
>>>>>> revision 2)
>>>>>>
>>>>>> None
>>>>>>
>>>>>> {'text': '      public boolean isLikeLeftMult() {', 'line': 892}
>>>>>>
>>>>>>    896
>>>>>>
>>>>>>       /**
>>>>>>
>>>>>>   Move this comments to the base class
>>>>>>
>>>>>>
>>>>>> - Robin
>>>>>>
>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote:
>>>>>>   Review request for mahout, Ted Dunning, Sebastian Schelter, and
>>>>>> Robin Anil.
>>>>>> By Dan Filimon.
>>>>>>
>>>>>> *Updated April 22, 2013, 8:45 p.m.*
>>>>>> Description
>>>>>>
>>>>>> This patch contains code cleaning up AbstractVector and making the 
>>>>>> operations as fast as possible while still having a high level interface.
>>>>>>
>>>>>> The main changes are in AbstractVector as well as new methods in 
>>>>>> DoubleDoubleFunction.
>>>>>>
>>>>>>   Testing
>>>>>>
>>>>>> The vectors test pass but it's likely that the patch in it's current 
>>>>>> state is broken as other, unrelated tests (BallKMeans...) are failing.
>>>>>> Also, my Hadoop conf is broken so I didn't run all the core tests. 
>>>>>> Anyone?
>>>>>>
>>>>>> I can't seem to find the bug, so _please_ have a closer look. It's still 
>>>>>> a work in progress.
>>>>>>
>>>>>> The benchmarks seem comparable (although there are some jarring 
>>>>>> diferences – Minkowski distance seems a lot slower in new-dan-1 than 
>>>>>> old-trunk-2). It may be however that this is just variance due to the 
>>>>>> load of the machine at the time. I'm having trouble interpreting the 
>>>>>> benchmarks in general, so anyone who could give me a hand is more than 
>>>>>> welcome.
>>>>>>
>>>>>>   Diffs
>>>>>>
>>>>>>    - math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
>>>>>>    (e12aa38)
>>>>>>    - math/src/main/java/org/apache/mahout/math/AbstractVector.java
>>>>>>    (090aa7a)
>>>>>>    - math/src/main/java/org/apache/mahout/math/Centroid.java
>>>>>>    (0c42196)
>>>>>>    - math/src/main/java/org/apache/mahout/math/ConstantVector.java
>>>>>>    (51d67d4)
>>>>>>    - math/src/main/java/org/apache/mahout/math/DelegatingVector.java
>>>>>>    (12220d4)
>>>>>>    - math/src/main/java/org/apache/mahout/math/DenseVector.java
>>>>>>    (41c356b)
>>>>>>    - 
>>>>>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
>>>>>>    (094003b)
>>>>>>    - math/src/main/java/org/apache/mahout/math/MatrixSlice.java
>>>>>>    (7f79c96)
>>>>>>    - math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
>>>>>>    (af70727)
>>>>>>    - math/src/main/java/org/apache/mahout/math/NamedVector.java
>>>>>>    (4b7a41d)
>>>>>>    - 
>>>>>> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
>>>>>>    (650d82d)
>>>>>>    - math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
>>>>>>    (d1ea93a)
>>>>>>    - 
>>>>>> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
>>>>>>    (6f85692)
>>>>>>    - 
>>>>>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
>>>>>>    (21982f9)
>>>>>>    - math/src/main/java/org/apache/mahout/math/Vector.java (2f8b417)
>>>>>>    - math/src/main/java/org/apache/mahout/math/VectorView.java
>>>>>>    (add2a60)
>>>>>>    - math/src/main/java/org/apache/mahout/math/WeightedVector.java
>>>>>>    (06fbd05)
>>>>>>    - 
>>>>>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
>>>>>>    (82b350a)
>>>>>>    - math/src/main/java/org/apache/mahout/math/function/Functions.java
>>>>>>    (eb2e42f)
>>>>>>    - math/src/main/java/org/apache/mahout/math/function/PlusMult.java
>>>>>>    (60587b1)
>>>>>>    - 
>>>>>> math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
>>>>>>    (8ab0bb1)
>>>>>>    - math/src/main/java/org/apache/mahout/math/jet/math/Constants.java
>>>>>>    (53535d6)
>>>>>>    - math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java
>>>>>>    (2b11199)
>>>>>>    - math/src/test/java/org/apache/mahout/math/VectorTest.java
>>>>>>    (d6d554b)
>>>>>>
>>>>>> View Diff <https://reviews.apache.org/r/10669/diff/>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to