Hi, Mark, I totally agree we shall optimize the EOP. The optimization I can think of now is: 1. In tez, we don't bound to one key per call mode, we can return EOP only when we exhaust all keys (you've already did that) 2. In MR, POJoinPackage currently return a EOP per flattened output, we can optimize to use a EOP per key
I still think STATUS_NULL just means a null value. I didn't find evidence it is used to mean something else. Ignoring STATUS_NULL in the runPipeline() should be right. That's because null tuple in a bag is valid, but when we flatten to the top level null is not valid. I did trace some join queries in MR code, I didn't find a case where we use STATUS_NULL to signal a situation other than a null value. Did you have a scenario where we use STATUS_NULL to signal something else? Thanks, Daniel On Tue, Nov 12, 2013 at 4:32 PM, Mark Wagner <[email protected]>wrote: > I'm definitely +1 on not having so many EOPs go through the pipeline > in the reducer. I've seen good improvements for maps in Tez, and I > suspect a join in classic MR could have similar gains. > > For STATUS_NULL: I don't agree that the semantics are clear right now. > In the PigGenericMapReduce$Reducer::processOnePackageOutput(context) > call that you mentioned, package output does have it's return status > checked and if it's STATUS_NULL, it is discarded. At one point I > suspected maybe STATUS_NULL really did just mean null and all the > relational operators were just ignoring null values, but many of the > expression operators have checks like (res.returnStatus == > POStatus.STATUS_OK && res.result != null). > > So it's certainly confusing to me and it doesn't seem that there's a > concrete definition of STATUS_NULL right now. I don't think there's > much utility in having a STATUS_NULL that indicates a null value when > I can just check res.result. That's why I'm in favor of #2. > > I have a patch that has passed all of 'test-commit' and is now going > through 'test'. I'll post this is in a JIRA + RB and we can have more > discussion there. > > Thanks, > Mark > > On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[email protected]> wrote: > > Also in continuation to our talk last week, it seems we are doing > something > > inefficient in MR. I believe that should be a separate issue other than > > STATUS_NULL, it is more than a semantic cleanup (actually semantic is > clear > > in existing code, just need some cleanup). > > > > Currently what Pig does is: > > 1. pull one record from POPackage/POJoinPackage > > 2. attach the record to pipeline > > 3. pull the leaf of pipeline until EOP > > 4. return to #1 > > > > Seems to have a EOP to go through the entire pipeline for each > > POPackage/POJoinPackage output is unnecessary. It can be optimized. It > > would be interesting to explore the opportunity of the optimization here. > > > > Thanks, > > Daniel > > > > > > On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[email protected]> > wrote: > > > >> We shall definitely clear the ambiguity. +1 for #1. > >> > >> For #2, I did some digging into POPackage. What I see is in POPackage we > >> produce STATUS_NULL, but not POJoinPackage. And the return value for > >> POPackage is not used: > >> if (pack instanceof POJoinPackage) { > >> while (true) > >> { > >> .... > >> if (processOnePackageOutput(context)) > >> break; > >> } > >> } > >> else { > >> .... > >> processOnePackageOutput(context); > >> } > >> > >> So it seems STATUS_NULL is not effectively in use in > >> POPackage/POJoinPackage. We shall always use STATUS_EOP to indicate the > end > >> of pull (for both POPackage and pipeline). We shall not use STATUS_NULL > >> anywhere once we make #1 clear. Am I right? > >> > >> Thanks, > >> Daniel > >> > >> > >> On Tue, Nov 12, 2013 at 8:35 AM, Cheolsoo Park <[email protected] > >wrote: > >> > >>> >> I'll post a patch that makes the second option official and changes > all > >>> the examples of the first to return null with STATUS_OK (this is what > >>> happens with the constant null currently). > >>> > >>> I am +1 for this. > >>> > >>> > >>> On Fri, Nov 8, 2013 at 10:16 PM, Mark Wagner <[email protected] > >>> >wrote: > >>> > >>> > Hi Devs, > >>> > > >>> > While discussing the Pig on Tez project today, we were unable to > reach > >>> > a conclusion on what the semantics of what POStatus.STATUS_NULL are > >>> > and would like to ask those who remember the history of Pig to chime > >>> > in. The two interpretations are: > >>> > > >>> > 1. POStatus.STATUS_NULL indicates that the pulled output IS null. > This > >>> > is used in some places like EqualToExpr. > >>> > > >>> > 2. POStatus.STATUS_NULL indicates that the pull did not produce any > >>> > output. This is backed up by it's usage in POPackage for flattening > an > >>> > empty bag, and PigGenericMapBase where pulls on the operator pipeline > >>> > that result in STATUS_NULL are discarded. > >>> > > >>> > If nobody has any concrete documentation or explanation for the > >>> > discrepancy, I'll post a patch that makes the second option official > >>> > and changes all the examples of the first to return null with > >>> > STATUS_OK (this is what happens with the constant null currently). > >>> > > >>> > Thanks, > >>> > Mark > >>> > > >>> > >> > >> > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or entity > to > > which it is addressed and may contain information that is confidential, > > privileged and exempt from disclosure under applicable law. If the reader > > of this message is not the intended recipient, you are hereby notified > that > > any printing, copying, dissemination, distribution, disclosure or > > forwarding of this communication is strictly prohibited. If you have > > received this communication in error, please contact the sender > immediately > > and delete it from your system. Thank You. > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
