Yes its reaching dev. I checked out your code to take a look. I understand
you are trying to still mix isSequentialAccess etc in the validity
calculation. The whole point of this architecture is to completely unroll
things to be more modular instead of creating a giant spaghetti
Here is an example of what I am trying to say.
This is your current implementation of AssignIterateOneLookupOther. There
are so many conditionals
1. @Override
2. public Vector assign(Vector x, Vector y, DoubleDoubleFunction f) {
3. return !swap ? assignInner(x, y, f) : assignInner(y, x, f);
4. }
5.
6. public Vector assignInner(Vector x, Vector y,
DoubleDoubleFunction f) {
7. Iterator<Vector.Element> xi = x.iterateNonZero();
8. Vector.Element xe;
9. Vector.Element ye;
10. OrderedIntDoubleMapping updates = newOrderedIntDoubleMapping();
11. while (xi.hasNext()) {
12. xe = xi.next();
13. ye = y.getElement(xe.index());
14. if (!swap) {
15. xe.set(f.apply(xe.get(), ye.get()));
16. } else {
17. if (ye.get() != 0.0 || y.isAddConstantTime()) {
18. ye.set(f.apply(ye.get(), xe.get()));
19. } else {
20. updates.set(xe.index(), f.apply(ye.get(), xe.get()));
21. }
22. }
23. }
24. if (swap && !y.isAddConstantTime()) {
25. y.mergeUpdates(updates);
26. }
27. return swap ? y : x;
28. }
29. }
Split this into.
IterateThisLookupThatInplaceUpdate
IterateThatLookupThisInplaceUpdate
IterateThisLookupThatMergeUpdate
IterateThatLookupThisMergeUpdate
Removing the conditionals itself will speedup the operations you see
regression on.
Also note that getElement() is not constant Add time for RASV. In earlier
version we tried at best to set in place. That reduced the extra hashkey
computation. getElement in AbstractVector is really optimized for
DenseVector and it does indicate as the cause of regression you are seeing
in RASV. Another one is the unused new OrderedIntDoubleMapping(). Removing
all these will bring back the loss.
Now the cost for each can be computed as: cost of iteration + cost of
lookup + cost of update (if its not in-place)
See the attached file for a mock test. You will have to rework the
expectations but the framework should be understandable.
------
Robin Anil
On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon <[email protected]>wrote:
> Here's [1] a link to the "design document" of the new vector operations so
> I can lay out the ideas behind what I'm doing more clearly.
>
> I'd like anyone who can to have a look and comment.
>
> Ted,
> I know you'd like me to get back to working on clustering, but currently,
> BallKMeans is roughly 2x as slow as Mahout's KMeans.
> This is because of centroid.update() for one (it doesn't have
> "interesting" properties that are exploited in the current) and also
> because we're using we're just doing lots of Vector operations: dot
> products for projections, hash codes for unique sets of candidates,
> equality checks etc.
>
> Here's a snapshot of the hotspots for one run of BallKMeans (20
> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions and
> ~100 nonzero values):
> Name Time (ms) org.apache.mahout.math.AbstractVector.dot(Vector) 104069
> org.apache.mahout.math.DelegatingVector.hashCode() 30159
> org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator,
> Iterator, DoubleDoubleFunction) 13720
> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping)
> 2911 java.lang.ClassLoader.loadClass(String, boolean) 2620
> sun.misc.Launcher$AppClassLoader.loadClass(String,
> boolean) 2620
>
>
> You can see that the most time is spent doing Vector operations. Combined,
> the first three rows account for 91% of the runtime.
>
> Thanks for your help!
>
> PS: Is this reaching dev@?
>
> [1]
> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#
>
>
> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <[email protected]> wrote:
>
>> Few more brain dump ideas
>>
>> For any operation (assign, aggregate) you are trying to do.
>> Compute set of "feasible" solutions: Based on all the properties of
>> functions
>> Choose the solution that minimises the cost.
>>
>> For this you may need to refactor all the special kinds of iterations and
>> aggregations subclassing from AssignOperation and AggregateOperation. Each
>> operation would return the cost given two input vectors (based on their
>> cost properties)
>>
>>
>>
>> ------
>> Robin Anil
>>
>>
>>
>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <[email protected]> wrote:
>>
>>> Strike that. I think the code is very confusing. You will have better
>>> luck if you create method for the three implementations. It will make
>>> things explicit and easy to debug.
>>>
>>> public final double getIterationCost() {
>>> return one of {cardinality, 1.2*numNonDefault, numNonDefault}
>>> }
>>>
>>> public final double getLookupCost() {
>>> return one of {1, 1.5(lets say), log_2(numNonDefault)}
>>> }
>>>
>>> All your logic then becomes a minimization of the total cost.
>>>
>>> This is more extendable than current isDense(), isSequential(),
>>> isSparse(), isRandom() and doesn't assume what the underlying
>>> implementation is.
>>>
>>>
>>> ------
>>> Robin Anil
>>>
>>>
>>>
>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <[email protected]>wrote:
>>>
>>>> Also the calculation for cost is wrong. RASV is HashMap based its
>>>> cost O(n) not O(log n)
>>>>
>>>> ------
>>>> Robin Anil
>>>>
>>>>
>>>>
>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <[email protected]>wrote:
>>>>
>>>>> This is an automatically generated e-mail. To reply, visit:
>>>>> https://reviews.apache.org/r/10669/
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> None
>>>>>
>>>>> 55
>>>>>
>>>>> boolean commutativeAndAssociative = aggregator.isCommutative() &&
>>>>> aggregator.isAssociative();
>>>>>
>>>>> This should be a method in AbstractFunction interface
>>>>>
>>>>>
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> 942
>>>>>
>>>>> // Make all the non-zero values 0.
>>>>>
>>>>> vector.clear()
>>>>>
>>>>>
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> 952
>>>>>
>>>>> while (it.hasNext()) {
>>>>>
>>>>> There is no guarantee of ordered iteration. The updates is unsorted.
>>>>> merge typically implies that input is sorted
>>>>>
>>>>>
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> 979
>>>>>
>>>>> element.set(values[index]);
>>>>>
>>>>> What if there are values that are not in the existing data. I dont this
>>>>> is correct
>>>>>
>>>>>
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
>>>>> 'expand_offset': 2}
>>>>>
>>>>> 999
>>>>>
>>>>> invalidateCachedLength();
>>>>>
>>>>> move invalidateCachedLength to the set method. Instead of dispersing it
>>>>> everywhere.
>>>>>
>>>>>
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> None
>>>>>
>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element>
>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>>>>
>>>>> 1014
>>>>>
>>>>> protected Vector assignSkipZerosIterateOne(Iterator<Element>
>>>>> thisIterator, Vector other,
>>>>>
>>>>> be consistent, this that or thisIterator that Iterator.
>>>>>
>>>>>
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> None
>>>>>
>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element>
>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
>>>>>
>>>>> 1070
>>>>>
>>>>> private Vector assignIterateBoth(Iterator<Element> thisIterator,
>>>>> Iterator<Element> thatIterator,
>>>>>
>>>>> This expects the data to to be sorted. indicate that in a comment.
>>>>> Similarly comment other places
>>>>>
>>>>>
>>>>>
>>>>> math/src/main/java/org/apache/mahout/math/function/Functions.java<https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900>
>>>>> (Diff
>>>>> revision 2)
>>>>>
>>>>> None
>>>>>
>>>>> {'text': ' public boolean isLikeLeftMult() {', 'line': 892}
>>>>>
>>>>> 896
>>>>>
>>>>> /**
>>>>>
>>>>> Move this comments to the base class
>>>>>
>>>>>
>>>>> - Robin
>>>>>
>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote:
>>>>> Review request for mahout, Ted Dunning, Sebastian Schelter, and
>>>>> Robin Anil.
>>>>> By Dan Filimon.
>>>>>
>>>>> *Updated April 22, 2013, 8:45 p.m.*
>>>>> Description
>>>>>
>>>>> This patch contains code cleaning up AbstractVector and making the
>>>>> operations as fast as possible while still having a high level interface.
>>>>>
>>>>> The main changes are in AbstractVector as well as new methods in
>>>>> DoubleDoubleFunction.
>>>>>
>>>>> Testing
>>>>>
>>>>> The vectors test pass but it's likely that the patch in it's current
>>>>> state is broken as other, unrelated tests (BallKMeans...) are failing.
>>>>> Also, my Hadoop conf is broken so I didn't run all the core tests. Anyone?
>>>>>
>>>>> I can't seem to find the bug, so _please_ have a closer look. It's still
>>>>> a work in progress.
>>>>>
>>>>> The benchmarks seem comparable (although there are some jarring
>>>>> diferences – Minkowski distance seems a lot slower in new-dan-1 than
>>>>> old-trunk-2). It may be however that this is just variance due to the
>>>>> load of the machine at the time. I'm having trouble interpreting the
>>>>> benchmarks in general, so anyone who could give me a hand is more than
>>>>> welcome.
>>>>>
>>>>> Diffs
>>>>>
>>>>> - math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
>>>>> (e12aa38)
>>>>> - math/src/main/java/org/apache/mahout/math/AbstractVector.java
>>>>> (090aa7a)
>>>>> - math/src/main/java/org/apache/mahout/math/Centroid.java (0c42196)
>>>>> - math/src/main/java/org/apache/mahout/math/ConstantVector.java
>>>>> (51d67d4)
>>>>> - math/src/main/java/org/apache/mahout/math/DelegatingVector.java
>>>>> (12220d4)
>>>>> - math/src/main/java/org/apache/mahout/math/DenseVector.java
>>>>> (41c356b)
>>>>> -
>>>>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
>>>>> (094003b)
>>>>> - math/src/main/java/org/apache/mahout/math/MatrixSlice.java
>>>>> (7f79c96)
>>>>> - math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
>>>>> (af70727)
>>>>> - math/src/main/java/org/apache/mahout/math/NamedVector.java
>>>>> (4b7a41d)
>>>>> -
>>>>> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
>>>>> (650d82d)
>>>>> - math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
>>>>> (d1ea93a)
>>>>> -
>>>>> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
>>>>> (6f85692)
>>>>> -
>>>>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
>>>>> (21982f9)
>>>>> - math/src/main/java/org/apache/mahout/math/Vector.java (2f8b417)
>>>>> - math/src/main/java/org/apache/mahout/math/VectorView.java
>>>>> (add2a60)
>>>>> - math/src/main/java/org/apache/mahout/math/WeightedVector.java
>>>>> (06fbd05)
>>>>> -
>>>>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
>>>>> (82b350a)
>>>>> - math/src/main/java/org/apache/mahout/math/function/Functions.java
>>>>> (eb2e42f)
>>>>> - math/src/main/java/org/apache/mahout/math/function/PlusMult.java
>>>>> (60587b1)
>>>>> - math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
>>>>> (8ab0bb1)
>>>>> - math/src/main/java/org/apache/mahout/math/jet/math/Constants.java
>>>>> (53535d6)
>>>>> - math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java
>>>>> (2b11199)
>>>>> - math/src/test/java/org/apache/mahout/math/VectorTest.java
>>>>> (d6d554b)
>>>>>
>>>>> View Diff <https://reviews.apache.org/r/10669/diff/>
>>>>>
>>>>
>>>>
>>>
>>
>