After this change the benchmark actually takes about 41 minutes :) Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
On Mon, Apr 29, 2013 at 9:45 PM, Robin Anil <[email protected]> wrote: > After increasing the lead time to 15 seconds, I believe I am giving enough > time for JIT to take place. This way we are measuring only the JITed code > not the interpreted code. There is a flag in JVM with which you can change > the threshold -XX:+CompileThreshold (default: 10000). I didnt want to mess > with it to make it as close as possible to the real world. > > I have also disabled the cluster benchmarks as they are not executed > enough times for JIT to compile them. I have to look more at them. for now > ignore them for your patch. > > But you can see that variance is greatly reduced (See Sheet Version8) > > https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=8 > > There are only a handful of regressions Norm1 (for Seq and Rand), Minus > and Plus (for Rand) and a few others. > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > > On Mon, Apr 29, 2013 at 11:37 AM, Robin Anil <[email protected]> wrote: > >> I have a few ideas to tweak around with the JIT to make the benchmarks >> more stable. I will ping back once I get some time to do something. >> >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >> >> >> On Mon, Apr 29, 2013 at 10:32 AM, Dan Filimon < >> [email protected]> wrote: >> >>> That's the weirdness, it's _exactly_ the same code. >>> >>> >>> On Mon, Apr 29, 2013 at 6:31 PM, Robin Anil <[email protected]> >>> wrote: >>> >>> > I will pull your Patch in and take a look a close look at it tonight. >>> Whats >>> > different in Run3 in Version7 v/s Run2 and Run1. Because Run3 looks >>> really >>> > good and the regressions of 30% in some are actually not as bad and >>> nothing >>> > no longer looks ridiculously blocking the patch. >>> > >>> > >>> > On Mon, Apr 29, 2013 at 10:13 AM, Dan Filimon >>> > <[email protected]>wrote: >>> > >>> > > I uploaded the latest patch to my vector branch on github [0]. >>> > > >>> > > The latest revision is, as always, here [1], "Version 7 >>> > (Assign/Aggregate; >>> > > 1 sec runtime)". Tests pass for this one except for some aggregate >>> tests >>> > > (it's not clear if aggregate should throw an exception when the >>> vectors >>> > are >>> > > empty or return 0; I chose return 0). >>> > > >>> > > There's quite a bit of variance, a lot of it even in code that I >>> haven't >>> > > changed between runs. So, norm2 shows regressions in version 7 and >>> > > improvements in version 6 and the code not having been changed. >>> > > >>> > > The dot product is also oscillating in version 7, showing different >>> > results >>> > > in the different runs (of the same algorithm). >>> > > >>> > > Except for some cleanup, I don't think I can do much better than >>> this, so >>> > > could someone please also run the benchmarks and I'll update the >>> patch on >>> > > ReviewBoard? >>> > > >>> > > [0] https://github.com/dfilimon/mahout/tree/vector >>> > > [1] >>> > > >>> > > >>> > >>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=7 >>> > > >>> > > >>> > > On Sun, Apr 28, 2013 at 11:18 PM, Ted Dunning <[email protected] >>> > >>> > > wrote: >>> > > >>> > > > It would *really* help to put a legend on these spreadsheets even >>> if >>> > you >>> > > > have described it in the past. >>> > > > >>> > > > I take it that these are number of ops and bigger is better, right? >>> > > > >>> > > > And where is a description of what patch runs 1 and 2 actually are? >>> > > > >>> > > > >>> > > > >>> > > > On Sun, Apr 28, 2013 at 2:23 AM, Dan Filimon < >>> > > [email protected]>wrote: >>> > > > >>> > > >> Here's [1] the most recent benchmarks, after refactoring the >>> Assign >>> > into >>> > > >> separate classes. >>> > > >> It's now using a 10 second runtime and a 1 second lead time and >>> these >>> > > are >>> > > >> only the benchmarks involving assign(). >>> > > >> >>> > > >> Oddly enough there are regressions with RandomAccessSparseVectors' >>> > > plus() >>> > > >> and minus() despite these working like trunk from what I can tell >>> from >>> > > the >>> > > >> tests (VectorBinaryAssignTest). >>> > > >> >>> > > >> Also, there's quite a bit of variance for some reason... Clone, >>> Create >>> > > >> are just two of the odd ones. >>> > > >> >>> > > >> [1] >>> > > >> >>> > > >>> > >>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4 >>> > > >> >>> > > >> >>> > > >> On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil <[email protected] >>> > >>> > > wrote: >>> > > >> >>> > > >>> Do in place for rand an dense. Assign a bug to me to speed up >>> rasv >>> > > >>> GetElement. >>> > > >>> On Apr 25, 2013 2:56 PM, "Dan Filimon" < >>> [email protected]> >>> > > >>> wrote: >>> > > >>> >>> > > >>> > Nearly done splitting the code up, but I'm not sure what the >>> costs >>> > > >>> should >>> > > >>> > ideally be. >>> > > >>> > >>> > > >>> > Robin, you proposed: cost of iteration + cost of lookup + cost >>> of >>> > > >>> update >>> > > >>> > (if its not in-place) >>> > > >>> > >>> > > >>> > This sounds like it's per element, rather than for the entire >>> > vector. >>> > > >>> Also, >>> > > >>> > are we just going to assume that there will be as many updates >>> as >>> > > >>> non-zeros >>> > > >>> > or as large as the array is? How does the cost of an update >>> factor >>> > > in? >>> > > >>> > >>> > > >>> > And, is "in-place" just for DenseVectors? >>> > > >>> > >>> > > >>> > >>> > > >>> > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil < >>> [email protected]> >>> > > >>> wrote: >>> > > >>> > >>> > > >>> > > Depends on the speed of copying a 1M double array v/s doing a >>> > > calloc >>> > > >>> + >>> > > >>> > > copying 1000 non zeros (Assuming java is doing that >>> underneath). >>> > > >>> > > >>> > > >>> > > ------ >>> > > >>> > > Robin Anil >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon < >>> > > >>> > [email protected] >>> > > >>> > > >wrote: >>> > > >>> > > >>> > > >>> > > > Right, but is clone() generally slower than assigning? That >>> > > >>> strikes me >>> > > >>> > as >>> > > >>> > > > odd; doesn't Java optimize copying the internal structures >>> > (there >>> > > >>> are >>> > > >>> > > > arrays underneath after all)? >>> > > >>> > > > >>> > > >>> > > > >>> > > >>> > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil < >>> > > [email protected] >>> > > >>> > >>> > > >>> > > wrote: >>> > > >>> > > > >>> > > >>> > > >> Seems like for dense clone is slower than like().assign I >>> need >>> > > to >>> > > >>> test >>> > > >>> > > it >>> > > >>> > > >> with different sizes to be sure. I kept it from the old >>> > > behavior. >>> > > >>> > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" < >>> > > >>> [email protected]> >>> > > >>> > > >> wrote: >>> > > >>> > > >> >>> > > >>> > > >>> Okay, so I should split it further into smaller sub-cases >>> > that >>> > > >>> handle >>> > > >>> > > >>> each Vector type. I tried making so that these match the >>> > cases >>> > > >>> in the >>> > > >>> > > >>> document to the extent possible. >>> > > >>> > > >>> You're right. It is ugly and I need to split it up. >>> > > >>> > > >>> >>> > > >>> > > >>> I removed the OrderedIntDoubleMapping (but with another >>> > if...) >>> > > >>> but >>> > > >>> > one >>> > > >>> > > >>> more thing that helped is making the assign() only go >>> through >>> > > the >>> > > >>> > > non-zeros >>> > > >>> > > >>> when setting up a new copy. >>> > > >>> > > >>> >>> > > >>> > > >>> About the copying, I saw that the optimized copy doesn't >>> > always >>> > > >>> call >>> > > >>> > > >>> clone clone on the Vector. Why is this? >>> > > >>> > > >>> >>> > > >>> > > >>> Thanks for the test and the feedback! :) >>> > > >>> > > >>> >>> > > >>> > > >>> >>> > > >>> > > >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil < >>> > > >>> [email protected] >>> > > >>> > > >wrote: >>> > > >>> > > >>> >>> > > >>> > > >>>> Yes its reaching dev. I checked out your code to take a >>> > look. >>> > > I >>> > > >>> > > >>>> understand you are trying to still mix >>> isSequentialAccess >>> > etc >>> > > >>> in the >>> > > >>> > > >>>> validity calculation. The whole point of this >>> architecture >>> > is >>> > > to >>> > > >>> > > completely >>> > > >>> > > >>>> unroll things to be more modular instead of creating a >>> giant >>> > > >>> > spaghetti >>> > > >>> > > >>>> >>> > > >>> > > >>>> Here is an example of what I am trying to say. >>> > > >>> > > >>>> >>> > > >>> > > >>>> This is your current implementation of >>> > > >>> AssignIterateOneLookupOther. >>> > > >>> > > >>>> There are so many conditionals >>> > > >>> > > >>>> >>> > > >>> > > >>>> >>> > > >>> > > >>>> 1. @Override >>> > > >>> > > >>>> 2. public Vector assign(Vector x, Vector y, >>> > > >>> > > >>>> DoubleDoubleFunction f) { >>> > > >>> > > >>>> 3. return !swap ? assignInner(x, y, f) : >>> > > >>> assignInner(y, x, >>> > > >>> > > f); >>> > > >>> > > >>>> 4. } >>> > > >>> > > >>>> 5. >>> > > >>> > > >>>> 6. public Vector assignInner(Vector x, Vector y, >>> > > >>> > > >>>> DoubleDoubleFunction f) { >>> > > >>> > > >>>> 7. Iterator<Vector.Element> xi = >>> > x.iterateNonZero(); >>> > > >>> > > >>>> 8. Vector.Element xe; >>> > > >>> > > >>>> 9. Vector.Element ye; >>> > > >>> > > >>>> 10. OrderedIntDoubleMapping updates = >>> > > >>> > > newOrderedIntDoubleMapping(); >>> > > >>> > > >>>> 11. while (xi.hasNext()) { >>> > > >>> > > >>>> 12. xe = xi.next(); >>> > > >>> > > >>>> 13. ye = y.getElement(xe.index()); >>> > > >>> > > >>>> 14. if (!swap) { >>> > > >>> > > >>>> 15. xe.set(f.apply(xe.get(), ye.get())); >>> > > >>> > > >>>> 16. } else { >>> > > >>> > > >>>> 17. if (ye.get() != 0.0 || >>> > > y.isAddConstantTime()) { >>> > > >>> > > >>>> 18. ye.set(f.apply(ye.get(), xe.get())); >>> > > >>> > > >>>> 19. } else { >>> > > >>> > > >>>> 20. updates.set(xe.index(), >>> f.apply(ye.get(), >>> > > >>> > > >>>> xe.get())); >>> > > >>> > > >>>> 21. } >>> > > >>> > > >>>> 22. } >>> > > >>> > > >>>> 23. } >>> > > >>> > > >>>> 24. if (swap && !y.isAddConstantTime()) { >>> > > >>> > > >>>> 25. y.mergeUpdates(updates); >>> > > >>> > > >>>> 26. } >>> > > >>> > > >>>> 27. return swap ? y : x; >>> > > >>> > > >>>> 28. } >>> > > >>> > > >>>> 29. } >>> > > >>> > > >>>> >>> > > >>> > > >>>> Split this into. >>> > > >>> > > >>>> >>> > > >>> > > >>>> IterateThisLookupThatInplaceUpdate >>> > > >>> > > >>>> IterateThatLookupThisInplaceUpdate >>> > > >>> > > >>>> IterateThisLookupThatMergeUpdate >>> > > >>> > > >>>> IterateThatLookupThisMergeUpdate >>> > > >>> > > >>>> >>> > > >>> > > >>>> Removing the conditionals itself will speedup the >>> operations >>> > > >>> you see >>> > > >>> > > >>>> regression on. >>> > > >>> > > >>>> >>> > > >>> > > >>>> Also note that getElement() is not constant Add time for >>> > RASV. >>> > > >>> In >>> > > >>> > > >>>> earlier version we tried at best to set in place. That >>> > reduced >>> > > >>> the >>> > > >>> > > extra >>> > > >>> > > >>>> hashkey computation. getElement in AbstractVector is >>> really >>> > > >>> > optimized >>> > > >>> > > for >>> > > >>> > > >>>> DenseVector and it does indicate as the cause of >>> regression >>> > > you >>> > > >>> are >>> > > >>> > > seeing >>> > > >>> > > >>>> in RASV. Another one is the unused new >>> > > >>> OrderedIntDoubleMapping(). >>> > > >>> > > Removing >>> > > >>> > > >>>> all these will bring back the loss. >>> > > >>> > > >>>> >>> > > >>> > > >>>> Now the cost for each can be computed as: cost of >>> iteration >>> > + >>> > > >>> cost >>> > > >>> > of >>> > > >>> > > >>>> lookup + cost of update (if its not in-place) >>> > > >>> > > >>>> >>> > > >>> > > >>>> See the attached file for a mock test. You will have to >>> > rework >>> > > >>> the >>> > > >>> > > >>>> expectations but the framework should be understandable. >>> > > >>> > > >>>> >>> > > >>> > > >>>> >>> > > >>> > > >>>> ------ >>> > > >>> > > >>>> Robin Anil >>> > > >>> > > >>>> >>> > > >>> > > >>>> >>> > > >>> > > >>>> >>> > > >>> > > >>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon < >>> > > >>> > > >>>> [email protected]> wrote: >>> > > >>> > > >>>> >>> > > >>> > > >>>>> Here's [1] a link to the "design document" of the new >>> > vector >>> > > >>> > > >>>>> operations so I can lay out the ideas behind what I'm >>> doing >>> > > >>> more >>> > > >>> > > clearly. >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> I'd like anyone who can to have a look and comment. >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> Ted, >>> > > >>> > > >>>>> I know you'd like me to get back to working on >>> clustering, >>> > > but >>> > > >>> > > >>>>> currently, BallKMeans is roughly 2x as slow as Mahout's >>> > > KMeans. >>> > > >>> > > >>>>> This is because of centroid.update() for one (it >>> doesn't >>> > have >>> > > >>> > > >>>>> "interesting" properties that are exploited in the >>> current) >>> > > and >>> > > >>> > also >>> > > >>> > > >>>>> because we're using we're just doing lots of Vector >>> > > >>> operations: dot >>> > > >>> > > >>>>> products for projections, hash codes for unique sets of >>> > > >>> candidates, >>> > > >>> > > >>>>> equality checks etc. >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> Here's a snapshot of the hotspots for one run of >>> BallKMeans >>> > > (20 >>> > > >>> > > >>>>> newsgroups, unprojected, i.e 20K sparse vectors with >>> 200K >>> > > >>> > dimensions >>> > > >>> > > and >>> > > >>> > > >>>>> ~100 nonzero values): >>> > > >>> > > >>>>> Name Time (ms) >>> > > >>> org.apache.mahout.math.AbstractVector.dot(Vector) >>> > > >>> > > >>>>> 104069 >>> org.apache.mahout.math.DelegatingVector.hashCode() >>> > > >>> 30159 >>> > > >>> > > >>> > org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator, >>> > > >>> > > >>>>> Iterator, DoubleDoubleFunction) 13720 >>> > > >>> > > >>>>> >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping) >>> > > >>> > > >>>>> 2911 java.lang.ClassLoader.loadClass(String, boolean) >>> 2620 >>> > > >>> > > sun.misc.Launcher$AppClassLoader.loadClass(String, >>> > > >>> > > >>>>> boolean) 2620 >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> You can see that the most time is spent doing Vector >>> > > >>> operations. >>> > > >>> > > >>>>> Combined, the first three rows account for 91% of the >>> > > runtime. >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> Thanks for your help! >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> PS: Is this reaching dev@? >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> [1] >>> > > >>> > > >>>>> >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit# >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> >>> > > >>> > > >>>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil < >>> > > >>> [email protected] >>> > > >>> > > >wrote: >>> > > >>> > > >>>>> >>> > > >>> > > >>>>>> Few more brain dump ideas >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> For any operation (assign, aggregate) you are trying >>> to >>> > do. >>> > > >>> > > >>>>>> Compute set of "feasible" solutions: Based on all the >>> > > >>> properties >>> > > >>> > of >>> > > >>> > > >>>>>> functions >>> > > >>> > > >>>>>> Choose the solution that minimises the cost. >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> For this you may need to refactor all the special >>> kinds of >>> > > >>> > > iterations >>> > > >>> > > >>>>>> and aggregations subclassing from AssignOperation and >>> > > >>> > > AggregateOperation. >>> > > >>> > > >>>>>> Each operation would return the cost given two input >>> > vectors >>> > > >>> > (based >>> > > >>> > > on >>> > > >>> > > >>>>>> their cost properties) >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> ------ >>> > > >>> > > >>>>>> Robin Anil >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil < >>> > > >>> [email protected] >>> > > >>> > > >wrote: >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>>>> Strike that. I think the code is very confusing. You >>> will >>> > > >>> have >>> > > >>> > > >>>>>>> better luck if you create method for the three >>> > > >>> implementations. >>> > > >>> > It >>> > > >>> > > will >>> > > >>> > > >>>>>>> make things explicit and easy to debug. >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> public final double getIterationCost() { >>> > > >>> > > >>>>>>> return one of {cardinality, 1.2*numNonDefault, >>> > > >>> numNonDefault} >>> > > >>> > > >>>>>>> } >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> public final double getLookupCost() { >>> > > >>> > > >>>>>>> return one of {1, 1.5(lets say), >>> log_2(numNonDefault)} >>> > > >>> > > >>>>>>> } >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> All your logic then becomes a minimization of the >>> total >>> > > cost. >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> This is more extendable than current isDense(), >>> > > >>> isSequential(), >>> > > >>> > > >>>>>>> isSparse(), isRandom() and doesn't assume what the >>> > > underlying >>> > > >>> > > >>>>>>> implementation is. >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> ------ >>> > > >>> > > >>>>>>> Robin Anil >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil < >>> > > >>> > [email protected] >>> > > >>> > > >wrote: >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>>>> Also the calculation for cost is wrong. RASV is >>> HashMap >>> > > >>> based >>> > > >>> > its >>> > > >>> > > >>>>>>>> cost O(n) not O(log n) >>> > > >>> > > >>>>>>>> >>> > > >>> > > >>>>>>>> ------ >>> > > >>> > > >>>>>>>> Robin Anil >>> > > >>> > > >>>>>>>> >>> > > >>> > > >>>>>>>> >>> > > >>> > > >>>>>>>> >>> > > >>> > > >>>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil < >>> > > >>> > [email protected] >>> > > >>> > > >wrote: >>> > > >>> > > >>>>>>>> >>> > > >>> > > >>>>>>>>> This is an automatically generated e-mail. To >>> reply, >>> > > >>> visit: >>> > > >>> > > >>>>>>>>> https://reviews.apache.org/r/10669/ >>> > > >>> > > >>>>>>>>> >>> > > >>> > math/src/main/java/org/apache/mahout/math/AbstractVector.java< >>> > > >>> > > >>> > > >>> >>> > > >>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56> >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> None >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 55 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> boolean commutativeAndAssociative = >>> > > >>> > > aggregator.isCommutative() && aggregator.isAssociative(); >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> This should be a method in AbstractFunction >>> interface >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > math/src/main/java/org/apache/mahout/math/AbstractVector.java< >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037 >>> > > >>> > >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 726, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 927, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 942 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> // Make all the non-zero values 0. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> vector.clear() >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > math/src/main/java/org/apache/mahout/math/AbstractVector.java< >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047 >>> > > >>> > >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 726, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 927, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 952 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> while (it.hasNext()) { >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> There is no guarantee of ordered iteration. The >>> > updates >>> > > >>> is >>> > > >>> > > unsorted. merge typically implies that input is sorted >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > math/src/main/java/org/apache/mahout/math/AbstractVector.java< >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074 >>> > > >>> > >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 726, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 927, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 979 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> element.set(values[index]); >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> What if there are values that are not in the >>> existing >>> > > >>> data. I >>> > > >>> > > dont this is correct >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > math/src/main/java/org/apache/mahout/math/AbstractVector.java< >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100 >>> > > >>> > >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 726, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', >>> > 'line': >>> > > >>> 927, >>> > > >>> > > 'expand_offset': 2} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 999 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> invalidateCachedLength(); >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> move invalidateCachedLength to the set method. >>> > Instead >>> > > of >>> > > >>> > > dispersing it everywhere. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > math/src/main/java/org/apache/mahout/math/AbstractVector.java< >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115 >>> > > >>> > >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> None >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': ' private Vector >>> > > >>> assignIterateBoth(Iterator<Element> >>> > > >>> > > thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 1014 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> protected Vector >>> > > >>> assignSkipZerosIterateOne(Iterator<Element> >>> > > >>> > > thisIterator, Vector other, >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> be consistent, this that or thisIterator that >>> > Iterator. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > math/src/main/java/org/apache/mahout/math/AbstractVector.java< >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171 >>> > > >>> > >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> None >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': ' private Vector >>> > > >>> assignIterateBoth(Iterator<Element> >>> > > >>> > > thisIterator, Iterator<Element> thatIterator,', 'line': 1070} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 1070 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> private Vector >>> assignIterateBoth(Iterator<Element> >>> > > >>> > > thisIterator, Iterator<Element> thatIterator, >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> This expects the data to to be sorted. indicate >>> that >>> > > in a >>> > > >>> > > comment. Similarly comment other places >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>> > math/src/main/java/org/apache/mahout/math/function/Functions.java< >>> > > >>> > > >>> > > >>> >>> > > >>> https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900 >>> > > >>> > > >>> > > >>> > > (Diff >>> > > >>> > > >>>>>>>>> revision 2) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> None >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> {'text': ' public boolean isLikeLeftMult() {', >>> > > 'line': >>> > > >>> > 892} >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> 896 >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> /** >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> Move this comments to the base class >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> - Robin >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote: >>> > > >>> > > >>>>>>>>> Review request for mahout, Ted Dunning, Sebastian >>> > > >>> Schelter, >>> > > >>> > and >>> > > >>> > > >>>>>>>>> Robin Anil. >>> > > >>> > > >>>>>>>>> By Dan Filimon. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> *Updated April 22, 2013, 8:45 p.m.* >>> > > >>> > > >>>>>>>>> Description >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> This patch contains code cleaning up >>> AbstractVector and >>> > > >>> making >>> > > >>> > > the operations as fast as possible while still having a high >>> > level >>> > > >>> > > interface. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> The main changes are in AbstractVector as well as >>> new >>> > > >>> methods >>> > > >>> > in >>> > > >>> > > DoubleDoubleFunction. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> Testing >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> The vectors test pass but it's likely that the >>> patch in >>> > > >>> it's >>> > > >>> > > current state is broken as other, unrelated tests >>> (BallKMeans...) >>> > > are >>> > > >>> > > failing. >>> > > >>> > > >>>>>>>>> Also, my Hadoop conf is broken so I didn't run all >>> the >>> > > core >>> > > >>> > > tests. Anyone? >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> I can't seem to find the bug, so _please_ have a >>> closer >>> > > >>> look. >>> > > >>> > > It's still a work in progress. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> The benchmarks seem comparable (although there are >>> some >>> > > >>> jarring >>> > > >>> > > diferences – Minkowski distance seems a lot slower in >>> new-dan-1 >>> > > than >>> > > >>> > > old-trunk-2). It may be however that this is just variance >>> due to >>> > > the >>> > > >>> > load >>> > > >>> > > of the machine at the time. I'm having trouble interpreting >>> the >>> > > >>> > benchmarks >>> > > >>> > > in general, so anyone who could give me a hand is more than >>> > > welcome. >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> Diffs >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > math/src/main/java/org/apache/mahout/math/AbstractMatrix.java >>> > > >>> > > >>>>>>>>> (e12aa38) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > math/src/main/java/org/apache/mahout/math/AbstractVector.java >>> > > >>> > > >>>>>>>>> (090aa7a) >>> > > >>> > > >>>>>>>>> - >>> > > >>> math/src/main/java/org/apache/mahout/math/Centroid.java >>> > > >>> > > >>>>>>>>> (0c42196) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > math/src/main/java/org/apache/mahout/math/ConstantVector.java >>> > > >>> > > >>>>>>>>> (51d67d4) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> math/src/main/java/org/apache/mahout/math/DelegatingVector.java >>> > > >>> > > >>>>>>>>> (12220d4) >>> > > >>> > > >>>>>>>>> - >>> > > >>> math/src/main/java/org/apache/mahout/math/DenseVector.java >>> > > >>> > > >>>>>>>>> (41c356b) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java >>> > > >>> > > >>>>>>>>> (094003b) >>> > > >>> > > >>>>>>>>> - >>> > > >>> math/src/main/java/org/apache/mahout/math/MatrixSlice.java >>> > > >>> > > >>>>>>>>> (7f79c96) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> math/src/main/java/org/apache/mahout/math/MatrixVectorView.java >>> > > >>> > > >>>>>>>>> (af70727) >>> > > >>> > > >>>>>>>>> - >>> > > >>> math/src/main/java/org/apache/mahout/math/NamedVector.java >>> > > >>> > > >>>>>>>>> (4b7a41d) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> > > >>> >>> > math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java >>> > > >>> > > >>>>>>>>> (650d82d) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> math/src/main/java/org/apache/mahout/math/PermutedVectorView.java >>> > > >>> > > >>>>>>>>> (d1ea93a) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> > > >>> >>> > math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java >>> > > >>> > > >>>>>>>>> (6f85692) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java >>> > > >>> > > >>>>>>>>> (21982f9) >>> > > >>> > > >>>>>>>>> - >>> > > math/src/main/java/org/apache/mahout/math/Vector.java >>> > > >>> > > >>>>>>>>> (2f8b417) >>> > > >>> > > >>>>>>>>> - >>> > > >>> math/src/main/java/org/apache/mahout/math/VectorView.java >>> > > >>> > > >>>>>>>>> (add2a60) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > math/src/main/java/org/apache/mahout/math/WeightedVector.java >>> > > >>> > > >>>>>>>>> (06fbd05) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java >>> > > >>> > > >>>>>>>>> (82b350a) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> math/src/main/java/org/apache/mahout/math/function/Functions.java >>> > > >>> > > >>>>>>>>> (eb2e42f) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> math/src/main/java/org/apache/mahout/math/function/PlusMult.java >>> > > >>> > > >>>>>>>>> (60587b1) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> > > math/src/main/java/org/apache/mahout/math/function/TimesFunction.java >>> > > >>> > > >>>>>>>>> (8ab0bb1) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> math/src/main/java/org/apache/mahout/math/jet/math/Constants.java >>> > > >>> > > >>>>>>>>> (53535d6) >>> > > >>> > > >>>>>>>>> - >>> > > >>> > > >>> math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java >>> > > >>> > > >>>>>>>>> (2b11199) >>> > > >>> > > >>>>>>>>> - >>> > > >>> math/src/test/java/org/apache/mahout/math/VectorTest.java >>> > > >>> > > >>>>>>>>> (d6d554b) >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>>> View Diff < >>> https://reviews.apache.org/r/10669/diff/> >>> > > >>> > > >>>>>>>>> >>> > > >>> > > >>>>>>>> >>> > > >>> > > >>>>>>>> >>> > > >>> > > >>>>>>> >>> > > >>> > > >>>>>> >>> > > >>> > > >>>>> >>> > > >>> > > >>>> >>> > > >>> > > >>> >>> > > >>> > > > >>> > > >>> > > >>> > > >>> > >>> > > >>> >>> > > >> >>> > > >> >>> > > > >>> > > >>> > >>> >> >> >
