Right, but is clone() generally slower than assigning? That strikes me as odd; doesn't Java optimize copying the internal structures (there are arrays underneath after all)?
On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <[email protected]> wrote: > Seems like for dense clone is slower than like().assign I need to test it > with different sizes to be sure. I kept it from the old behavior. > On Apr 25, 2013 2:12 AM, "Dan Filimon" <[email protected]> > wrote: > >> Okay, so I should split it further into smaller sub-cases that handle >> each Vector type. I tried making so that these match the cases in the >> document to the extent possible. >> You're right. It is ugly and I need to split it up. >> >> I removed the OrderedIntDoubleMapping (but with another if...) but one >> more thing that helped is making the assign() only go through the non-zeros >> when setting up a new copy. >> >> About the copying, I saw that the optimized copy doesn't always call >> clone clone on the Vector. Why is this? >> >> Thanks for the test and the feedback! :) >> >> >> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <[email protected]> wrote: >> >>> Yes its reaching dev. I checked out your code to take a look. I >>> understand you are trying to still mix isSequentialAccess etc in the >>> validity calculation. The whole point of this architecture is to completely >>> unroll things to be more modular instead of creating a giant spaghetti >>> >>> Here is an example of what I am trying to say. >>> >>> This is your current implementation of AssignIterateOneLookupOther. >>> There are so many conditionals >>> >>> >>> 1. @Override >>> 2. public Vector assign(Vector x, Vector y, DoubleDoubleFunction >>> f) { >>> 3. return !swap ? assignInner(x, y, f) : assignInner(y, x, f); >>> 4. } >>> 5. >>> 6. public Vector assignInner(Vector x, Vector y, >>> DoubleDoubleFunction f) { >>> 7. Iterator<Vector.Element> xi = x.iterateNonZero(); >>> 8. Vector.Element xe; >>> 9. Vector.Element ye; >>> 10. OrderedIntDoubleMapping updates = newOrderedIntDoubleMapping(); >>> 11. while (xi.hasNext()) { >>> 12. xe = xi.next(); >>> 13. ye = y.getElement(xe.index()); >>> 14. if (!swap) { >>> 15. xe.set(f.apply(xe.get(), ye.get())); >>> 16. } else { >>> 17. if (ye.get() != 0.0 || y.isAddConstantTime()) { >>> 18. ye.set(f.apply(ye.get(), xe.get())); >>> 19. } else { >>> 20. updates.set(xe.index(), f.apply(ye.get(), xe.get())); >>> 21. } >>> 22. } >>> 23. } >>> 24. if (swap && !y.isAddConstantTime()) { >>> 25. y.mergeUpdates(updates); >>> 26. } >>> 27. return swap ? y : x; >>> 28. } >>> 29. } >>> >>> Split this into. >>> >>> IterateThisLookupThatInplaceUpdate >>> IterateThatLookupThisInplaceUpdate >>> IterateThisLookupThatMergeUpdate >>> IterateThatLookupThisMergeUpdate >>> >>> Removing the conditionals itself will speedup the operations you see >>> regression on. >>> >>> Also note that getElement() is not constant Add time for RASV. In >>> earlier version we tried at best to set in place. That reduced the extra >>> hashkey computation. getElement in AbstractVector is really optimized for >>> DenseVector and it does indicate as the cause of regression you are seeing >>> in RASV. Another one is the unused new OrderedIntDoubleMapping(). Removing >>> all these will bring back the loss. >>> >>> Now the cost for each can be computed as: cost of iteration + cost of >>> lookup + cost of update (if its not in-place) >>> >>> See the attached file for a mock test. You will have to rework the >>> expectations but the framework should be understandable. >>> >>> >>> ------ >>> Robin Anil >>> >>> >>> >>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon < >>> [email protected]> wrote: >>> >>>> Here's [1] a link to the "design document" of the new vector operations >>>> so I can lay out the ideas behind what I'm doing more clearly. >>>> >>>> I'd like anyone who can to have a look and comment. >>>> >>>> Ted, >>>> I know you'd like me to get back to working on clustering, but >>>> currently, BallKMeans is roughly 2x as slow as Mahout's KMeans. >>>> This is because of centroid.update() for one (it doesn't have >>>> "interesting" properties that are exploited in the current) and also >>>> because we're using we're just doing lots of Vector operations: dot >>>> products for projections, hash codes for unique sets of candidates, >>>> equality checks etc. >>>> >>>> Here's a snapshot of the hotspots for one run of BallKMeans (20 >>>> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions and >>>> ~100 nonzero values): >>>> Name Time (ms) org.apache.mahout.math.AbstractVector.dot(Vector) >>>> 104069 org.apache.mahout.math.DelegatingVector.hashCode() 30159 >>>> org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator, >>>> Iterator, DoubleDoubleFunction) 13720 >>>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping) >>>> 2911 java.lang.ClassLoader.loadClass(String, boolean) 2620 >>>> sun.misc.Launcher$AppClassLoader.loadClass(String, >>>> boolean) 2620 >>>> >>>> >>>> You can see that the most time is spent doing Vector operations. >>>> Combined, the first three rows account for 91% of the runtime. >>>> >>>> Thanks for your help! >>>> >>>> PS: Is this reaching dev@? >>>> >>>> [1] >>>> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit# >>>> >>>> >>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <[email protected]>wrote: >>>> >>>>> Few more brain dump ideas >>>>> >>>>> For any operation (assign, aggregate) you are trying to do. >>>>> Compute set of "feasible" solutions: Based on all the properties of >>>>> functions >>>>> Choose the solution that minimises the cost. >>>>> >>>>> For this you may need to refactor all the special kinds of iterations >>>>> and aggregations subclassing from AssignOperation and AggregateOperation. >>>>> Each operation would return the cost given two input vectors (based on >>>>> their cost properties) >>>>> >>>>> >>>>> >>>>> ------ >>>>> Robin Anil >>>>> >>>>> >>>>> >>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <[email protected]>wrote: >>>>> >>>>>> Strike that. I think the code is very confusing. You will have better >>>>>> luck if you create method for the three implementations. It will make >>>>>> things explicit and easy to debug. >>>>>> >>>>>> public final double getIterationCost() { >>>>>> return one of {cardinality, 1.2*numNonDefault, numNonDefault} >>>>>> } >>>>>> >>>>>> public final double getLookupCost() { >>>>>> return one of {1, 1.5(lets say), log_2(numNonDefault)} >>>>>> } >>>>>> >>>>>> All your logic then becomes a minimization of the total cost. >>>>>> >>>>>> This is more extendable than current isDense(), isSequential(), >>>>>> isSparse(), isRandom() and doesn't assume what the underlying >>>>>> implementation is. >>>>>> >>>>>> >>>>>> ------ >>>>>> Robin Anil >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <[email protected]>wrote: >>>>>> >>>>>>> Also the calculation for cost is wrong. RASV is HashMap based its >>>>>>> cost O(n) not O(log n) >>>>>>> >>>>>>> ------ >>>>>>> Robin Anil >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <[email protected]>wrote: >>>>>>> >>>>>>>> This is an automatically generated e-mail. To reply, visit: >>>>>>>> https://reviews.apache.org/r/10669/ >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> None >>>>>>>> >>>>>>>> 55 >>>>>>>> >>>>>>>> boolean commutativeAndAssociative = aggregator.isCommutative() && >>>>>>>> aggregator.isAssociative(); >>>>>>>> >>>>>>>> This should be a method in AbstractFunction interface >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> 942 >>>>>>>> >>>>>>>> // Make all the non-zero values 0. >>>>>>>> >>>>>>>> vector.clear() >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> 952 >>>>>>>> >>>>>>>> while (it.hasNext()) { >>>>>>>> >>>>>>>> There is no guarantee of ordered iteration. The updates is unsorted. >>>>>>>> merge typically implies that input is sorted >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> 979 >>>>>>>> >>>>>>>> element.set(values[index]); >>>>>>>> >>>>>>>> What if there are values that are not in the existing data. I dont >>>>>>>> this is correct >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, >>>>>>>> 'expand_offset': 2} >>>>>>>> >>>>>>>> 999 >>>>>>>> >>>>>>>> invalidateCachedLength(); >>>>>>>> >>>>>>>> move invalidateCachedLength to the set method. Instead of dispersing >>>>>>>> it everywhere. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> None >>>>>>>> >>>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> >>>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>>>>>>> >>>>>>>> 1014 >>>>>>>> >>>>>>>> protected Vector assignSkipZerosIterateOne(Iterator<Element> >>>>>>>> thisIterator, Vector other, >>>>>>>> >>>>>>>> be consistent, this that or thisIterator that Iterator. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java<https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> None >>>>>>>> >>>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> >>>>>>>> thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>>>>>>> >>>>>>>> 1070 >>>>>>>> >>>>>>>> private Vector assignIterateBoth(Iterator<Element> thisIterator, >>>>>>>> Iterator<Element> thatIterator, >>>>>>>> >>>>>>>> This expects the data to to be sorted. indicate that in a comment. >>>>>>>> Similarly comment other places >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> math/src/main/java/org/apache/mahout/math/function/Functions.java<https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900> >>>>>>>> (Diff >>>>>>>> revision 2) >>>>>>>> >>>>>>>> None >>>>>>>> >>>>>>>> {'text': ' public boolean isLikeLeftMult() {', 'line': 892} >>>>>>>> >>>>>>>> 896 >>>>>>>> >>>>>>>> /** >>>>>>>> >>>>>>>> Move this comments to the base class >>>>>>>> >>>>>>>> >>>>>>>> - Robin >>>>>>>> >>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote: >>>>>>>> Review request for mahout, Ted Dunning, Sebastian Schelter, and >>>>>>>> Robin Anil. >>>>>>>> By Dan Filimon. >>>>>>>> >>>>>>>> *Updated April 22, 2013, 8:45 p.m.* >>>>>>>> Description >>>>>>>> >>>>>>>> This patch contains code cleaning up AbstractVector and making the >>>>>>>> operations as fast as possible while still having a high level >>>>>>>> interface. >>>>>>>> >>>>>>>> The main changes are in AbstractVector as well as new methods in >>>>>>>> DoubleDoubleFunction. >>>>>>>> >>>>>>>> Testing >>>>>>>> >>>>>>>> The vectors test pass but it's likely that the patch in it's current >>>>>>>> state is broken as other, unrelated tests (BallKMeans...) are failing. >>>>>>>> Also, my Hadoop conf is broken so I didn't run all the core tests. >>>>>>>> Anyone? >>>>>>>> >>>>>>>> I can't seem to find the bug, so _please_ have a closer look. It's >>>>>>>> still a work in progress. >>>>>>>> >>>>>>>> The benchmarks seem comparable (although there are some jarring >>>>>>>> diferences – Minkowski distance seems a lot slower in new-dan-1 than >>>>>>>> old-trunk-2). It may be however that this is just variance due to the >>>>>>>> load of the machine at the time. I'm having trouble interpreting the >>>>>>>> benchmarks in general, so anyone who could give me a hand is more than >>>>>>>> welcome. >>>>>>>> >>>>>>>> Diffs >>>>>>>> >>>>>>>> - math/src/main/java/org/apache/mahout/math/AbstractMatrix.java >>>>>>>> (e12aa38) >>>>>>>> - math/src/main/java/org/apache/mahout/math/AbstractVector.java >>>>>>>> (090aa7a) >>>>>>>> - math/src/main/java/org/apache/mahout/math/Centroid.java >>>>>>>> (0c42196) >>>>>>>> - math/src/main/java/org/apache/mahout/math/ConstantVector.java >>>>>>>> (51d67d4) >>>>>>>> - math/src/main/java/org/apache/mahout/math/DelegatingVector.java >>>>>>>> (12220d4) >>>>>>>> - math/src/main/java/org/apache/mahout/math/DenseVector.java >>>>>>>> (41c356b) >>>>>>>> - >>>>>>>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java >>>>>>>> (094003b) >>>>>>>> - math/src/main/java/org/apache/mahout/math/MatrixSlice.java >>>>>>>> (7f79c96) >>>>>>>> - math/src/main/java/org/apache/mahout/math/MatrixVectorView.java >>>>>>>> (af70727) >>>>>>>> - math/src/main/java/org/apache/mahout/math/NamedVector.java >>>>>>>> (4b7a41d) >>>>>>>> - >>>>>>>> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java >>>>>>>> (650d82d) >>>>>>>> - math/src/main/java/org/apache/mahout/math/PermutedVectorView.java >>>>>>>> (d1ea93a) >>>>>>>> - >>>>>>>> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java >>>>>>>> (6f85692) >>>>>>>> - >>>>>>>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java >>>>>>>> (21982f9) >>>>>>>> - math/src/main/java/org/apache/mahout/math/Vector.java >>>>>>>> (2f8b417) >>>>>>>> - math/src/main/java/org/apache/mahout/math/VectorView.java >>>>>>>> (add2a60) >>>>>>>> - math/src/main/java/org/apache/mahout/math/WeightedVector.java >>>>>>>> (06fbd05) >>>>>>>> - >>>>>>>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java >>>>>>>> (82b350a) >>>>>>>> - math/src/main/java/org/apache/mahout/math/function/Functions.java >>>>>>>> (eb2e42f) >>>>>>>> - math/src/main/java/org/apache/mahout/math/function/PlusMult.java >>>>>>>> (60587b1) >>>>>>>> - >>>>>>>> math/src/main/java/org/apache/mahout/math/function/TimesFunction.java >>>>>>>> (8ab0bb1) >>>>>>>> - math/src/main/java/org/apache/mahout/math/jet/math/Constants.java >>>>>>>> (53535d6) >>>>>>>> - math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java >>>>>>>> (2b11199) >>>>>>>> - math/src/test/java/org/apache/mahout/math/VectorTest.java >>>>>>>> (d6d554b) >>>>>>>> >>>>>>>> View Diff <https://reviews.apache.org/r/10669/diff/> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>
