I'm definitely +1 on not having so many EOPs go through the pipeline in the reducer. I've seen good improvements for maps in Tez, and I suspect a join in classic MR could have similar gains.
For STATUS_NULL: I don't agree that the semantics are clear right now. In the PigGenericMapReduce$Reducer::processOnePackageOutput(context) call that you mentioned, package output does have it's return status checked and if it's STATUS_NULL, it is discarded. At one point I suspected maybe STATUS_NULL really did just mean null and all the relational operators were just ignoring null values, but many of the expression operators have checks like (res.returnStatus == POStatus.STATUS_OK && res.result != null). So it's certainly confusing to me and it doesn't seem that there's a concrete definition of STATUS_NULL right now. I don't think there's much utility in having a STATUS_NULL that indicates a null value when I can just check res.result. That's why I'm in favor of #2. I have a patch that has passed all of 'test-commit' and is now going through 'test'. I'll post this is in a JIRA + RB and we can have more discussion there. Thanks, Mark On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[email protected]> wrote: > Also in continuation to our talk last week, it seems we are doing something > inefficient in MR. I believe that should be a separate issue other than > STATUS_NULL, it is more than a semantic cleanup (actually semantic is clear > in existing code, just need some cleanup). > > Currently what Pig does is: > 1. pull one record from POPackage/POJoinPackage > 2. attach the record to pipeline > 3. pull the leaf of pipeline until EOP > 4. return to #1 > > Seems to have a EOP to go through the entire pipeline for each > POPackage/POJoinPackage output is unnecessary. It can be optimized. It > would be interesting to explore the opportunity of the optimization here. > > Thanks, > Daniel > > > On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[email protected]> wrote: > >> We shall definitely clear the ambiguity. +1 for #1. >> >> For #2, I did some digging into POPackage. What I see is in POPackage we >> produce STATUS_NULL, but not POJoinPackage. And the return value for >> POPackage is not used: >> if (pack instanceof POJoinPackage) { >> while (true) >> { >> .... >> if (processOnePackageOutput(context)) >> break; >> } >> } >> else { >> .... >> processOnePackageOutput(context); >> } >> >> So it seems STATUS_NULL is not effectively in use in >> POPackage/POJoinPackage. We shall always use STATUS_EOP to indicate the end >> of pull (for both POPackage and pipeline). We shall not use STATUS_NULL >> anywhere once we make #1 clear. Am I right? >> >> Thanks, >> Daniel >> >> >> On Tue, Nov 12, 2013 at 8:35 AM, Cheolsoo Park <[email protected]>wrote: >> >>> >> I'll post a patch that makes the second option official and changes all >>> the examples of the first to return null with STATUS_OK (this is what >>> happens with the constant null currently). >>> >>> I am +1 for this. >>> >>> >>> On Fri, Nov 8, 2013 at 10:16 PM, Mark Wagner <[email protected] >>> >wrote: >>> >>> > Hi Devs, >>> > >>> > While discussing the Pig on Tez project today, we were unable to reach >>> > a conclusion on what the semantics of what POStatus.STATUS_NULL are >>> > and would like to ask those who remember the history of Pig to chime >>> > in. The two interpretations are: >>> > >>> > 1. POStatus.STATUS_NULL indicates that the pulled output IS null. This >>> > is used in some places like EqualToExpr. >>> > >>> > 2. POStatus.STATUS_NULL indicates that the pull did not produce any >>> > output. This is backed up by it's usage in POPackage for flattening an >>> > empty bag, and PigGenericMapBase where pulls on the operator pipeline >>> > that result in STATUS_NULL are discarded. >>> > >>> > If nobody has any concrete documentation or explanation for the >>> > discrepancy, I'll post a patch that makes the second option official >>> > and changes all the examples of the first to return null with >>> > STATUS_OK (this is what happens with the constant null currently). >>> > >>> > Thanks, >>> > Mark >>> > >>> >> >> > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.
