The main case that motivated my 'no result' interpretation was line 299 in POPackage where flattening empty bags results in STATUS_NULL and that many of the expression operators do not check STATUS_NULL, but check result != null instead.
I think we can break this down into two parts, each of which are probably more easily agreed upon: 1. Is it necessary to have a POStatus just to signify null? Why can't we just set result = null with STATUS_OK? (This should only arise in expression plans) (I'd lean towards using null and STATUS_OK) 2. Does it make sense to have a POStatus which signifies an empty result which should be ignored? Or should we just use the representation of null for this. This should only arise in top-level/relational operators, I believe. I don't feel strongly about this, as long as we're consistent. -Mark On Tue, Nov 12, 2013 at 5:30 PM, Daniel Dai <[email protected]> wrote: > Hi, Mark, > I totally agree we shall optimize the EOP. The optimization I can think of > now is: > 1. In tez, we don't bound to one key per call mode, we can return EOP only > when we exhaust all keys (you've already did that) > 2. In MR, POJoinPackage currently return a EOP per flattened output, we can > optimize to use a EOP per key > > I still think STATUS_NULL just means a null value. I didn't find evidence > it is used to mean something else. Ignoring STATUS_NULL in the > runPipeline() should be right. That's because null tuple in a bag is valid, > but when we flatten to the top level null is not valid. I did trace some > join queries in MR code, I didn't find a case where we use STATUS_NULL to > signal a situation other than a null value. Did you have a scenario where > we use STATUS_NULL to signal something else? > > Thanks, > Daniel > > > On Tue, Nov 12, 2013 at 4:32 PM, Mark Wagner <[email protected]>wrote: > >> I'm definitely +1 on not having so many EOPs go through the pipeline >> in the reducer. I've seen good improvements for maps in Tez, and I >> suspect a join in classic MR could have similar gains. >> >> For STATUS_NULL: I don't agree that the semantics are clear right now. >> In the PigGenericMapReduce$Reducer::processOnePackageOutput(context) >> call that you mentioned, package output does have it's return status >> checked and if it's STATUS_NULL, it is discarded. At one point I >> suspected maybe STATUS_NULL really did just mean null and all the >> relational operators were just ignoring null values, but many of the >> expression operators have checks like (res.returnStatus == >> POStatus.STATUS_OK && res.result != null). >> >> So it's certainly confusing to me and it doesn't seem that there's a >> concrete definition of STATUS_NULL right now. I don't think there's >> much utility in having a STATUS_NULL that indicates a null value when >> I can just check res.result. That's why I'm in favor of #2. >> >> I have a patch that has passed all of 'test-commit' and is now going >> through 'test'. I'll post this is in a JIRA + RB and we can have more >> discussion there. >> >> Thanks, >> Mark >> >> On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[email protected]> wrote: >> > Also in continuation to our talk last week, it seems we are doing >> something >> > inefficient in MR. I believe that should be a separate issue other than >> > STATUS_NULL, it is more than a semantic cleanup (actually semantic is >> clear >> > in existing code, just need some cleanup). >> > >> > Currently what Pig does is: >> > 1. pull one record from POPackage/POJoinPackage >> > 2. attach the record to pipeline >> > 3. pull the leaf of pipeline until EOP >> > 4. return to #1 >> > >> > Seems to have a EOP to go through the entire pipeline for each >> > POPackage/POJoinPackage output is unnecessary. It can be optimized. It >> > would be interesting to explore the opportunity of the optimization here. >> > >> > Thanks, >> > Daniel >> > >> > >> > On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[email protected]> >> wrote: >> > >> >> We shall definitely clear the ambiguity. +1 for #1. >> >> >> >> For #2, I did some digging into POPackage. What I see is in POPackage we >> >> produce STATUS_NULL, but not POJoinPackage. And the return value for >> >> POPackage is not used: >> >> if (pack instanceof POJoinPackage) { >> >> while (true) >> >> { >> >> .... >> >> if (processOnePackageOutput(context)) >> >> break; >> >> } >> >> } >> >> else { >> >> .... >> >> processOnePackageOutput(context); >> >> } >> >> >> >> So it seems STATUS_NULL is not effectively in use in >> >> POPackage/POJoinPackage. We shall always use STATUS_EOP to indicate the >> end >> >> of pull (for both POPackage and pipeline). We shall not use STATUS_NULL >> >> anywhere once we make #1 clear. Am I right? >> >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> On Tue, Nov 12, 2013 at 8:35 AM, Cheolsoo Park <[email protected] >> >wrote: >> >> >> >>> >> I'll post a patch that makes the second option official and changes >> all >> >>> the examples of the first to return null with STATUS_OK (this is what >> >>> happens with the constant null currently). >> >>> >> >>> I am +1 for this. >> >>> >> >>> >> >>> On Fri, Nov 8, 2013 at 10:16 PM, Mark Wagner <[email protected] >> >>> >wrote: >> >>> >> >>> > Hi Devs, >> >>> > >> >>> > While discussing the Pig on Tez project today, we were unable to >> reach >> >>> > a conclusion on what the semantics of what POStatus.STATUS_NULL are >> >>> > and would like to ask those who remember the history of Pig to chime >> >>> > in. The two interpretations are: >> >>> > >> >>> > 1. POStatus.STATUS_NULL indicates that the pulled output IS null. >> This >> >>> > is used in some places like EqualToExpr. >> >>> > >> >>> > 2. POStatus.STATUS_NULL indicates that the pull did not produce any >> >>> > output. This is backed up by it's usage in POPackage for flattening >> an >> >>> > empty bag, and PigGenericMapBase where pulls on the operator pipeline >> >>> > that result in STATUS_NULL are discarded. >> >>> > >> >>> > If nobody has any concrete documentation or explanation for the >> >>> > discrepancy, I'll post a patch that makes the second option official >> >>> > and changes all the examples of the first to return null with >> >>> > STATUS_OK (this is what happens with the constant null currently). >> >>> > >> >>> > Thanks, >> >>> > Mark >> >>> > >> >>> >> >> >> >> >> > >> > -- >> > CONFIDENTIALITY NOTICE >> > NOTICE: This message is intended for the use of the individual or entity >> to >> > which it is addressed and may contain information that is confidential, >> > privileged and exempt from disclosure under applicable law. If the reader >> > of this message is not the intended recipient, you are hereby notified >> that >> > any printing, copying, dissemination, distribution, disclosure or >> > forwarding of this communication is strictly prohibited. If you have >> > received this communication in error, please contact the sender >> immediately >> > and delete it from your system. Thank You. >> > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.
