Seems like for dense clone is slower than like().assign I need to test it with different sizes to be sure. I kept it from the old behavior. On Apr 25, 2013 2:12 AM, "Dan Filimon" <[email protected]> wrote:
> Okay, so I should split it further into smaller sub-cases that handle each > Vector type. I tried making so that these match the cases in the document > to the extent possible. > You're right. It is ugly and I need to split it up. > > I removed the OrderedIntDoubleMapping (but with another if...) but one > more thing that helped is making the assign() only go through the non-zeros > when setting up a new copy. > > About the copying, I saw that the optimized copy doesn't always call clone > clone on the Vector. Why is this? > > Thanks for the test and the feedback! :) > > > On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <[email protected]> wrote: > >> Yes its reaching dev. I checked out your code to take a look. I >> understand you are trying to still mix isSequentialAccess etc in the >> validity calculation. The whole point of this architecture is to completely >> unroll things to be more modular instead of creating a giant spaghetti >> >> Here is an example of what I am trying to say. >> >> This is your current implementation of AssignIterateOneLookupOther. There >> are so many conditionals >> >> >> 1. @Override >> 2. public Vector assign(Vector x, Vector y, DoubleDoubleFunction >> f) { >> 3. return !swap ? assignInner(x, y, f) : assignInner(y, x, f); >> 4. } >> 5. >> 6. public Vector assignInner(Vector x, Vector y, >> DoubleDoubleFunction f) { >> 7. Iterator<Vector.Element> xi = x.iterateNonZero(); >> 8. Vector.Element xe; >> 9. Vector.Element ye; >> 10. OrderedIntDoubleMapping updates = newOrderedIntDoubleMapping(); >> 11. while (xi.hasNext()) { >> 12. xe = xi.next(); >> 13. ye = y.getElement(xe.index()); >> 14. if (!swap) { >> 15. xe.set(f.apply(xe.get(), ye.get())); >> 16. } else { >> 17. if (ye.get() != 0.0 || y.isAddConstantTime()) { >> 18. ye.set(f.apply(ye.get(), xe.get())); >> 19. } else { >> 20. updates.set(xe.index(), f.apply(ye.get(), xe.get())); >> 21. } >> 22. } >> 23. } >> 24. if (swap && !y.isAddConstantTime()) { >> 25. y.mergeUpdates(updates); >> 26. } >> 27. return swap ? y : x; >> 28. } >> 29. } >> >> Split this into. >> >> IterateThisLookupThatInplaceUpdate >> IterateThatLookupThisInplaceUpdate >> IterateThisLookupThatMergeUpdate >> IterateThatLookupThisMergeUpdate >> >> Removing the conditionals itself will speedup the operations you see >> regression on. >> >> Also note that getElement() is not constant Add time for RASV. In earlier >> version we tried at best to set in place. That reduced the extra hashkey >> computation. getElement in AbstractVector is really optimized for >> DenseVector and it does indicate as the cause of regression you are seeing >> in RASV. Another one is the unused new OrderedIntDoubleMapping(). Removing >> all these will bring back the loss. >> >> Now the cost for each can be computed as: cost of iteration + cost of >> lookup + cost of update (if its not in-place) >> >> See the attached file for a mock test. You will have to rework the >> expectations but the framework should be understandable. >> >> >> ------ >> Robin Anil >> >> >> >> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon <[email protected] >> > wrote: >> >>> Here's [1] a link to the "design document" of the new vector operations >>> so I can lay out the ideas behind what I'm doing more clearly. >>> >>> I'd like anyone who can to have a look and comment. >>> >>> Ted, >>> I know you'd like me to get back to working on clustering, but >>> currently, BallKMeans is roughly 2x as slow as Mahout's KMeans. >>> This is because of centroid.update() for one (it doesn't have >>> "interesting" properties that are exploited in the current) and also >>> because we're using we're just doing lots of Vector operations: dot >>> products for projections, hash codes for unique sets of candidates, >>> equality checks etc. >>> >>> Here's a snapshot of the hotspots for one run of BallKMeans (20 >>> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions and >>> ~100 nonzero values): >>> Name Time (ms) org.apache.mahout.math.AbstractVector.dot(Vector) >>> 104069 org.apache.mahout.math.DelegatingVector.hashCode() 30159 >>> org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator, >>> Iterator, DoubleDoubleFunction) 13720 >>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping) >>> 2911 java.lang.ClassLoader.loadClass(String, boolean) 2620 >>> sun.misc.Launcher$AppClassLoader.loadClass(String, >>> boolean) 2620 >>> >>> >>> You can see that the most time is spent doing Vector operations. >>> Combined, the first three rows account for 91% of the runtime. >>> >>> Thanks for your help! >>> >>> PS: Is this reaching dev@? >>> >>> [1] >>> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit# >>> >>> >>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <[email protected]>wrote: >>> >>>> Few more brain dump ideas >>>> >>>> For any operation (assign, aggregate) you are trying to do. >>>> Compute set of "feasible" solutions: Based on all the properties of >>>> functions >>>> Choose the solution that minimises the cost. >>>> >>>> For this you may need to refactor all the special kinds of iterations >>>> and aggregations subclassing from AssignOperation and AggregateOperation. >>>> Each operation would return the cost given two input vectors (based on >>>> their cost properties) >>>> >>>> >>>> >>>> ------ >>>> Robin Anil >>>> >>>> >>>> >>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <[email protected]>wrote: >>>> >>>>> Strike that. I think the code is very confusing. You will have better >>>>> luck if you create method for the three implementations. It will make >>>>> things explicit and easy to debug. >>>>> >>>>> public final double getIterationCost() { >>>>> return one of {cardinality, 1.2*numNonDefault, numNonDefault} >>>>> } >>>>> >>>>> public final double getLookupCost() { >>>>> return one of {1, 1.5(lets say), log_2(numNonDefault)} >>>>> } >>>>> >>>>> All your logic then becomes a minimization of the total cost. >>>>> >>>>> This is more extendable than current isDense(), isSequential(), >>>>> isSparse(), isRandom() and doesn't assume what the underlying >>>>> implementation is. >>>>> >>>>> >>>>> ------ >>>>> Robin Anil >>>>> >>>>> >>>>> >>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <[email protected]>wrote: >>>>> >>>>>> Also the calculation for cost is wrong. RASV is HashMap based its >>>>>> cost O(n) not O(log n) >>>>>> >>>>>> ------ >>>>>> Robin Anil >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <[email protected]>wrote: >>>>>> >>>>>>> This is an automatically generated e-mail. To reply, visit: >>>>>>> https://reviews.apache.org/r/10669/ >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> None >>>>>>> >>>>>>> 55 >>>>>>> >>>>>>> boolean commutativeAndAssociative = aggregator.isCommutative() && >>>>>>> aggregator.isAssociative(); >>>>>>> >>>>>>> This should be a method in AbstractFunction interface >>>>>>> >>>>>>> >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> 942 >>>>>>> >>>>>>> // Make all the non-zero values 0. >>>>>>> >>>>>>> vector.clear() >>>>>>> >>>>>>> >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> 952 >>>>>>> >>>>>>> while (it.hasNext()) { >>>>>>> >>>>>>> There is no guarantee of ordered iteration. The updates is unsorted. >>>>>>> merge typically implies that input is sorted >>>>>>> >>>>>>> >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> 979 >>>>>>> >>>>>>> element.set(values[index]); >>>>>>> >>>>>>> What if there are values that are not in the existing data. I dont >>>>>>> this is correct >>>>>>> >>>>>>> >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>> 'expand_offset': 2} >>>>>>> >>>>>>> 999 >>>>>>> >>>>>>> invalidateCachedLength(); >>>>>>> >>>>>>> move invalidateCachedLength to the set method. Instead of dispersing >>>>>>> it everywhere. >>>>>>> >>>>>>> >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> None >>>>>>> >>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> >>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>>>>>> >>>>>>> 1014 >>>>>>> >>>>>>> protected Vector assignSkipZerosIterateOne(Iterator<Element> >>>>>>> thisIterator, Vector other, >>>>>>> >>>>>>> be consistent, this that or thisIterator that Iterator. >>>>>>> >>>>>>> >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> None >>>>>>> >>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> >>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>>>>>> >>>>>>> 1070 >>>>>>> >>>>>>> private Vector assignIterateBoth(Iterator<Element> thisIterator, >>>>>>> Iterator<Element> thatIterator, >>>>>>> >>>>>>> This expects the data to to be sorted. indicate that in a comment. >>>>>>> Similarly comment other places >>>>>>> >>>>>>> >>>>>>> >>>>>>> math/src/main/java/org/apache/mahout/math/function/Functions.java<https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900> >>>>>>> (Diff >>>>>>> revision 2) >>>>>>> >>>>>>> None >>>>>>> >>>>>>> {'text': ' public boolean isLikeLeftMult() {', 'line': 892} >>>>>>> >>>>>>> 896 >>>>>>> >>>>>>> /** >>>>>>> >>>>>>> Move this comments to the base class >>>>>>> >>>>>>> >>>>>>> - Robin >>>>>>> >>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote: >>>>>>> Review request for mahout, Ted Dunning, Sebastian Schelter, and >>>>>>> Robin Anil. >>>>>>> By Dan Filimon. >>>>>>> >>>>>>> *Updated April 22, 2013, 8:45 p.m.* >>>>>>> Description >>>>>>> >>>>>>> This patch contains code cleaning up AbstractVector and making the >>>>>>> operations as fast as possible while still having a high level >>>>>>> interface. >>>>>>> >>>>>>> The main changes are in AbstractVector as well as new methods in >>>>>>> DoubleDoubleFunction. >>>>>>> >>>>>>> Testing >>>>>>> >>>>>>> The vectors test pass but it's likely that the patch in it's current >>>>>>> state is broken as other, unrelated tests (BallKMeans...) are failing. >>>>>>> Also, my Hadoop conf is broken so I didn't run all the core tests. >>>>>>> Anyone? >>>>>>> >>>>>>> I can't seem to find the bug, so _please_ have a closer look. It's >>>>>>> still a work in progress. >>>>>>> >>>>>>> The benchmarks seem comparable (although there are some jarring >>>>>>> diferences – Minkowski distance seems a lot slower in new-dan-1 than >>>>>>> old-trunk-2). It may be however that this is just variance due to the >>>>>>> load of the machine at the time. I'm having trouble interpreting the >>>>>>> benchmarks in general, so anyone who could give me a hand is more than >>>>>>> welcome. >>>>>>> >>>>>>> Diffs >>>>>>> >>>>>>> - math/src/main/java/org/apache/mahout/math/AbstractMatrix.java >>>>>>> (e12aa38) >>>>>>> - math/src/main/java/org/apache/mahout/math/AbstractVector.java >>>>>>> (090aa7a) >>>>>>> - math/src/main/java/org/apache/mahout/math/Centroid.java >>>>>>> (0c42196) >>>>>>> - math/src/main/java/org/apache/mahout/math/ConstantVector.java >>>>>>> (51d67d4) >>>>>>> - math/src/main/java/org/apache/mahout/math/DelegatingVector.java >>>>>>> (12220d4) >>>>>>> - math/src/main/java/org/apache/mahout/math/DenseVector.java >>>>>>> (41c356b) >>>>>>> - >>>>>>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java >>>>>>> (094003b) >>>>>>> - math/src/main/java/org/apache/mahout/math/MatrixSlice.java >>>>>>> (7f79c96) >>>>>>> - math/src/main/java/org/apache/mahout/math/MatrixVectorView.java >>>>>>> (af70727) >>>>>>> - math/src/main/java/org/apache/mahout/math/NamedVector.java >>>>>>> (4b7a41d) >>>>>>> - >>>>>>> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java >>>>>>> (650d82d) >>>>>>> - math/src/main/java/org/apache/mahout/math/PermutedVectorView.java >>>>>>> (d1ea93a) >>>>>>> - >>>>>>> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java >>>>>>> (6f85692) >>>>>>> - >>>>>>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java >>>>>>> (21982f9) >>>>>>> - math/src/main/java/org/apache/mahout/math/Vector.java (2f8b417) >>>>>>> - math/src/main/java/org/apache/mahout/math/VectorView.java >>>>>>> (add2a60) >>>>>>> - math/src/main/java/org/apache/mahout/math/WeightedVector.java >>>>>>> (06fbd05) >>>>>>> - >>>>>>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java >>>>>>> (82b350a) >>>>>>> - math/src/main/java/org/apache/mahout/math/function/Functions.java >>>>>>> (eb2e42f) >>>>>>> - math/src/main/java/org/apache/mahout/math/function/PlusMult.java >>>>>>> (60587b1) >>>>>>> - >>>>>>> math/src/main/java/org/apache/mahout/math/function/TimesFunction.java >>>>>>> (8ab0bb1) >>>>>>> - math/src/main/java/org/apache/mahout/math/jet/math/Constants.java >>>>>>> (53535d6) >>>>>>> - math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java >>>>>>> (2b11199) >>>>>>> - math/src/test/java/org/apache/mahout/math/VectorTest.java >>>>>>> (d6d554b) >>>>>>> >>>>>>> View Diff <https://reviews.apache.org/r/10669/diff/> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
