Depends on the speed of copying a 1M double array v/s doing a calloc + copying 1000 non zeros (Assuming java is doing that underneath).
------ Robin Anil On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <[email protected]>wrote: > Right, but is clone() generally slower than assigning? That strikes me as > odd; doesn't Java optimize copying the internal structures (there are > arrays underneath after all)? > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <[email protected]> wrote: > >> Seems like for dense clone is slower than like().assign I need to test it >> with different sizes to be sure. I kept it from the old behavior. >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <[email protected]> >> wrote: >> >>> Okay, so I should split it further into smaller sub-cases that handle >>> each Vector type. I tried making so that these match the cases in the >>> document to the extent possible. >>> You're right. It is ugly and I need to split it up. >>> >>> I removed the OrderedIntDoubleMapping (but with another if...) but one >>> more thing that helped is making the assign() only go through the non-zeros >>> when setting up a new copy. >>> >>> About the copying, I saw that the optimized copy doesn't always call >>> clone clone on the Vector. Why is this? >>> >>> Thanks for the test and the feedback! :) >>> >>> >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <[email protected]>wrote: >>> >>>> Yes its reaching dev. I checked out your code to take a look. I >>>> understand you are trying to still mix isSequentialAccess etc in the >>>> validity calculation. The whole point of this architecture is to completely >>>> unroll things to be more modular instead of creating a giant spaghetti >>>> >>>> Here is an example of what I am trying to say. >>>> >>>> This is your current implementation of AssignIterateOneLookupOther. >>>> There are so many conditionals >>>> >>>> >>>> 1. @Override >>>> 2. public Vector assign(Vector x, Vector y, >>>> DoubleDoubleFunction f) { >>>> 3. return !swap ? assignInner(x, y, f) : assignInner(y, x, f); >>>> 4. } >>>> 5. >>>> 6. public Vector assignInner(Vector x, Vector y, >>>> DoubleDoubleFunction f) { >>>> 7. Iterator<Vector.Element> xi = x.iterateNonZero(); >>>> 8. Vector.Element xe; >>>> 9. Vector.Element ye; >>>> 10. OrderedIntDoubleMapping updates = >>>> newOrderedIntDoubleMapping(); >>>> 11. while (xi.hasNext()) { >>>> 12. xe = xi.next(); >>>> 13. ye = y.getElement(xe.index()); >>>> 14. if (!swap) { >>>> 15. xe.set(f.apply(xe.get(), ye.get())); >>>> 16. } else { >>>> 17. if (ye.get() != 0.0 || y.isAddConstantTime()) { >>>> 18. ye.set(f.apply(ye.get(), xe.get())); >>>> 19. } else { >>>> 20. updates.set(xe.index(), f.apply(ye.get(), >>>> xe.get())); >>>> 21. } >>>> 22. } >>>> 23. } >>>> 24. if (swap && !y.isAddConstantTime()) { >>>> 25. y.mergeUpdates(updates); >>>> 26. } >>>> 27. return swap ? y : x; >>>> 28. } >>>> 29. } >>>> >>>> Split this into. >>>> >>>> IterateThisLookupThatInplaceUpdate >>>> IterateThatLookupThisInplaceUpdate >>>> IterateThisLookupThatMergeUpdate >>>> IterateThatLookupThisMergeUpdate >>>> >>>> Removing the conditionals itself will speedup the operations you see >>>> regression on. >>>> >>>> Also note that getElement() is not constant Add time for RASV. In >>>> earlier version we tried at best to set in place. That reduced the extra >>>> hashkey computation. getElement in AbstractVector is really optimized for >>>> DenseVector and it does indicate as the cause of regression you are seeing >>>> in RASV. Another one is the unused new OrderedIntDoubleMapping(). Removing >>>> all these will bring back the loss. >>>> >>>> Now the cost for each can be computed as: cost of iteration + cost of >>>> lookup + cost of update (if its not in-place) >>>> >>>> See the attached file for a mock test. You will have to rework the >>>> expectations but the framework should be understandable. >>>> >>>> >>>> ------ >>>> Robin Anil >>>> >>>> >>>> >>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon < >>>> [email protected]> wrote: >>>> >>>>> Here's [1] a link to the "design document" of the new vector >>>>> operations so I can lay out the ideas behind what I'm doing more clearly. >>>>> >>>>> I'd like anyone who can to have a look and comment. >>>>> >>>>> Ted, >>>>> I know you'd like me to get back to working on clustering, but >>>>> currently, BallKMeans is roughly 2x as slow as Mahout's KMeans. >>>>> This is because of centroid.update() for one (it doesn't have >>>>> "interesting" properties that are exploited in the current) and also >>>>> because we're using we're just doing lots of Vector operations: dot >>>>> products for projections, hash codes for unique sets of candidates, >>>>> equality checks etc. >>>>> >>>>> Here's a snapshot of the hotspots for one run of BallKMeans (20 >>>>> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions and >>>>> ~100 nonzero values): >>>>> Name Time (ms) org.apache.mahout.math.AbstractVector.dot(Vector) >>>>> 104069 org.apache.mahout.math.DelegatingVector.hashCode() 30159 >>>>> org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator, >>>>> Iterator, DoubleDoubleFunction) 13720 >>>>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping) >>>>> 2911 java.lang.ClassLoader.loadClass(String, boolean) 2620 >>>>> sun.misc.Launcher$AppClassLoader.loadClass(String, >>>>> boolean) 2620 >>>>> >>>>> >>>>> You can see that the most time is spent doing Vector operations. >>>>> Combined, the first three rows account for 91% of the runtime. >>>>> >>>>> Thanks for your help! >>>>> >>>>> PS: Is this reaching dev@? >>>>> >>>>> [1] >>>>> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit# >>>>> >>>>> >>>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <[email protected]>wrote: >>>>> >>>>>> Few more brain dump ideas >>>>>> >>>>>> For any operation (assign, aggregate) you are trying to do. >>>>>> Compute set of "feasible" solutions: Based on all the properties of >>>>>> functions >>>>>> Choose the solution that minimises the cost. >>>>>> >>>>>> For this you may need to refactor all the special kinds of iterations >>>>>> and aggregations subclassing from AssignOperation and AggregateOperation. >>>>>> Each operation would return the cost given two input vectors (based on >>>>>> their cost properties) >>>>>> >>>>>> >>>>>> >>>>>> ------ >>>>>> Robin Anil >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <[email protected]>wrote: >>>>>> >>>>>>> Strike that. I think the code is very confusing. You will have >>>>>>> better luck if you create method for the three implementations. It will >>>>>>> make things explicit and easy to debug. >>>>>>> >>>>>>> public final double getIterationCost() { >>>>>>> return one of {cardinality, 1.2*numNonDefault, numNonDefault} >>>>>>> } >>>>>>> >>>>>>> public final double getLookupCost() { >>>>>>> return one of {1, 1.5(lets say), log_2(numNonDefault)} >>>>>>> } >>>>>>> >>>>>>> All your logic then becomes a minimization of the total cost. >>>>>>> >>>>>>> This is more extendable than current isDense(), isSequential(), >>>>>>> isSparse(), isRandom() and doesn't assume what the underlying >>>>>>> implementation is. >>>>>>> >>>>>>> >>>>>>> ------ >>>>>>> Robin Anil >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <[email protected]>wrote: >>>>>>> >>>>>>>> Also the calculation for cost is wrong. RASV is HashMap based its >>>>>>>> cost O(n) not O(log n) >>>>>>>> >>>>>>>> ------ >>>>>>>> Robin Anil >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> This is an automatically generated e-mail. To reply, visit: >>>>>>>>> https://reviews.apache.org/r/10669/ >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> None >>>>>>>>> >>>>>>>>> 55 >>>>>>>>> >>>>>>>>> boolean commutativeAndAssociative = aggregator.isCommutative() && >>>>>>>>> aggregator.isAssociative(); >>>>>>>>> >>>>>>>>> This should be a method in AbstractFunction interface >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> 942 >>>>>>>>> >>>>>>>>> // Make all the non-zero values 0. >>>>>>>>> >>>>>>>>> vector.clear() >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> 952 >>>>>>>>> >>>>>>>>> while (it.hasNext()) { >>>>>>>>> >>>>>>>>> There is no guarantee of ordered iteration. The updates is >>>>>>>>> unsorted. merge typically implies that input is sorted >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> 979 >>>>>>>>> >>>>>>>>> element.set(values[index]); >>>>>>>>> >>>>>>>>> What if there are values that are not in the existing data. I dont >>>>>>>>> this is correct >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>>> 'expand_offset': 2} >>>>>>>>> >>>>>>>>> 999 >>>>>>>>> >>>>>>>>> invalidateCachedLength(); >>>>>>>>> >>>>>>>>> move invalidateCachedLength to the set method. Instead of >>>>>>>>> dispersing it everywhere. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> None >>>>>>>>> >>>>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> >>>>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>>>>>>>> >>>>>>>>> 1014 >>>>>>>>> >>>>>>>>> protected Vector assignSkipZerosIterateOne(Iterator<Element> >>>>>>>>> thisIterator, Vector other, >>>>>>>>> >>>>>>>>> be consistent, this that or thisIterator that Iterator. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> None >>>>>>>>> >>>>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> >>>>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>>>>>>>> >>>>>>>>> 1070 >>>>>>>>> >>>>>>>>> private Vector assignIterateBoth(Iterator<Element> thisIterator, >>>>>>>>> Iterator<Element> thatIterator, >>>>>>>>> >>>>>>>>> This expects the data to to be sorted. indicate that in a comment. >>>>>>>>> Similarly comment other places >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> math/src/main/java/org/apache/mahout/math/function/Functions.java<https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900> >>>>>>>>> (Diff >>>>>>>>> revision 2) >>>>>>>>> >>>>>>>>> None >>>>>>>>> >>>>>>>>> {'text': ' public boolean isLikeLeftMult() {', 'line': 892} >>>>>>>>> >>>>>>>>> 896 >>>>>>>>> >>>>>>>>> /** >>>>>>>>> >>>>>>>>> Move this comments to the base class >>>>>>>>> >>>>>>>>> >>>>>>>>> - Robin >>>>>>>>> >>>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote: >>>>>>>>> Review request for mahout, Ted Dunning, Sebastian Schelter, and >>>>>>>>> Robin Anil. >>>>>>>>> By Dan Filimon. >>>>>>>>> >>>>>>>>> *Updated April 22, 2013, 8:45 p.m.* >>>>>>>>> Description >>>>>>>>> >>>>>>>>> This patch contains code cleaning up AbstractVector and making the >>>>>>>>> operations as fast as possible while still having a high level >>>>>>>>> interface. >>>>>>>>> >>>>>>>>> The main changes are in AbstractVector as well as new methods in >>>>>>>>> DoubleDoubleFunction. >>>>>>>>> >>>>>>>>> Testing >>>>>>>>> >>>>>>>>> The vectors test pass but it's likely that the patch in it's current >>>>>>>>> state is broken as other, unrelated tests (BallKMeans...) are failing. >>>>>>>>> Also, my Hadoop conf is broken so I didn't run all the core tests. >>>>>>>>> Anyone? >>>>>>>>> >>>>>>>>> I can't seem to find the bug, so _please_ have a closer look. It's >>>>>>>>> still a work in progress. >>>>>>>>> >>>>>>>>> The benchmarks seem comparable (although there are some jarring >>>>>>>>> diferences – Minkowski distance seems a lot slower in new-dan-1 than >>>>>>>>> old-trunk-2). It may be however that this is just variance due to the >>>>>>>>> load of the machine at the time. I'm having trouble interpreting the >>>>>>>>> benchmarks in general, so anyone who could give me a hand is more >>>>>>>>> than welcome. >>>>>>>>> >>>>>>>>> Diffs >>>>>>>>> >>>>>>>>> - math/src/main/java/org/apache/mahout/math/AbstractMatrix.java >>>>>>>>> (e12aa38) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/AbstractVector.java >>>>>>>>> (090aa7a) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/Centroid.java >>>>>>>>> (0c42196) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/ConstantVector.java >>>>>>>>> (51d67d4) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/DelegatingVector.java >>>>>>>>> (12220d4) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/DenseVector.java >>>>>>>>> (41c356b) >>>>>>>>> - >>>>>>>>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java >>>>>>>>> (094003b) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/MatrixSlice.java >>>>>>>>> (7f79c96) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/MatrixVectorView.java >>>>>>>>> (af70727) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/NamedVector.java >>>>>>>>> (4b7a41d) >>>>>>>>> - >>>>>>>>> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java >>>>>>>>> (650d82d) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/PermutedVectorView.java >>>>>>>>> (d1ea93a) >>>>>>>>> - >>>>>>>>> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java >>>>>>>>> (6f85692) >>>>>>>>> - >>>>>>>>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java >>>>>>>>> (21982f9) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/Vector.java >>>>>>>>> (2f8b417) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/VectorView.java >>>>>>>>> (add2a60) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/WeightedVector.java >>>>>>>>> (06fbd05) >>>>>>>>> - >>>>>>>>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java >>>>>>>>> (82b350a) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/function/Functions.java >>>>>>>>> (eb2e42f) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/function/PlusMult.java >>>>>>>>> (60587b1) >>>>>>>>> - >>>>>>>>> math/src/main/java/org/apache/mahout/math/function/TimesFunction.java >>>>>>>>> (8ab0bb1) >>>>>>>>> - math/src/main/java/org/apache/mahout/math/jet/math/Constants.java >>>>>>>>> (53535d6) >>>>>>>>> - math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java >>>>>>>>> (2b11199) >>>>>>>>> - math/src/test/java/org/apache/mahout/math/VectorTest.java >>>>>>>>> (d6d554b) >>>>>>>>> >>>>>>>>> View Diff <https://reviews.apache.org/r/10669/diff/> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >
