Thanks Ashutosh and Jon, it makes sense now. I wouldn't try optimizing that at the moment, no serious concerns due to it with our scripts. Good to know the reasons behind the implementation.
-Prashant On Sun, May 27, 2012 at 1:34 AM, Jonathan Coveney <jcove...@gmail.com>wrote: > It's also worth noting that the speed gain is minimal. We're talking 4 > minutes per billion rows (per mapper) at most, and the speed gain only > applies to size()-max(setIndex) i.e. if you have a tuple of size 30 and > then set up to 15, then yeah, you wasted time nulling out 15 fields. It's > pretty minimal, though if you want to tackle making the implementation more > efficient and benchmark the difference, I'm all ears :) > > 2012/5/27 Ashutosh Chauhan <hashut...@apache.org> > > > I also had this question when I was playing with DefaultTuple once. I > > removed that null-addition loop, but then it failed while setting an > object > > at an specific index. There is a code in Pig which sets value at a > specific > > index in the tuple, which will throw NPE if you dont first add null in > the > > list. This is one of those oddities of java. > > > > List<String> list = new ArrayList<String>(3); > > list.set(1,"mystring"); > > > > throws NPE, while following doesn't. > > > > List<String> list = new ArrayList<String>(3); > > for (int i =0; i < 3; i++) { > > list.add(null); > > } > > list.set(1,"mystring"); > > > > Since client code using DefaultTuple is permissible to set value at any > > index of tuple, we have no choice but to null-fill the list first. > > > > Hope it helps, > > Ashutosh > > > > On Sat, May 26, 2012 at 9:13 PM, Jonathan Coveney <jcove...@gmail.com > > >wrote: > > > > > -user > > > +dev > > > > > > Haha, I made this very same comment somewhere, and noticed the exact > same > > > thing (I think I mention it in my SchemaTuple benchmarking). > > > > > > The reason is so that tuple.size() will return the right value. Also, > the > > > expectation is that if you append, it goes to the end of all of the > > nulls, > > > not the first position. It's a little confusing, and yeah, it surprised > > me > > > too. > > > > > > You could definitely amortize the cost of creation over the sets that > the > > > user does by keeping an index, but when I first saw it I decided that > the > > > (slightly) increased memory footprint and the increase in code > complexity > > > wasn't worth a very minimal increase. > > > > > > 2012/5/26 Prashant Kommireddi <prash1...@gmail.com> > > > > > > > I rambled across this while reviewing one of Jon's patches. Here is > the > > > > code from DefaultTuple > > > > > > > > /** > > > > * Construct a tuple with a known number of fields. Package level > so > > > > that callers cannot directly invoke it. > > > > * <br>Resulting tuple is filled pre-filled with null elements. > Time > > > > complexity: O(N), after allocation > > > > * > > > > * @param size > > > > * Number of fields to allocate in the tuple. > > > > */ > > > > DefaultTuple(int size) { > > > > mFields = new ArrayList<Object>(size); > > > > for (int i = 0; i < size; i++) > > > > mFields.add(null); > > > > } > > > > > > > > > > > > Why are we walking through the list to add nulls? Wouldn't the > initial > > > > creation of ArrayList suffice? > > > > mFields = new ArrayList<Object>(size) should be enough. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > > >