After some more digging, I now agree with you on the fix for POForEach. The
fix I did in PIG-3060 is wrong and could result missing records in the
testcase below:

A = load 'student'....;
B = load 'voter'....;
C = cogroup student by name, voter by name;
D = foreach C generate flatten(A), flatten(B);
E = foreach D generate flatten(BagGenrator1(name)),
flatten(BagGenerator2(contribution)); -- suppose BagGenrator1 generates
{(name)}, BagGenerator2 generates {(contribution)} if contribution>50,
othewise {}

Suppose reduce input is:
C: bob, {(bob, 19, 7.1)}, {(bob, republican, 30),(bob, democrat, 80)}

Now D produce the first tuple:
D: bob, 19, 7.1, bob, republican, 30

E before flatten {(bob}}, {}

Flatten E produce a EOP which skip the second output of D.

We need to roll it back.

Then seems your understanding about the STATUS_NULL is right. This could be
a convenient way to skip some records. Otherwise we can choose not to emit
STATUS_NULL and keep pulling new input, but that could make operator logic
complex.

I haven't go through the entire picture. Seems In the new model, we have
two concept "end of key" and "end of all inputs". In MR, we only need to
care about "end of key", the mapreduce framework will take care of "end of
all inputs". Not sure if any other change needed to make it work. Need to
explore a bit more.

Thanks,
Daniel


On Tue, Nov 12, 2013 at 6:23 PM, Mark Wagner <[email protected]>wrote:

> The main case that motivated my 'no result' interpretation was line
> 299 in POPackage where flattening empty bags results in STATUS_NULL
> and that many of the expression operators do not check STATUS_NULL,
> but check result != null instead.
>
> I think we can break this down into two parts, each of which are
> probably more easily agreed upon:
>
> 1. Is it necessary to have a POStatus just to signify null? Why can't
> we just set result = null with STATUS_OK? (This should only arise in
> expression plans) (I'd lean towards using null and STATUS_OK)
> 2. Does it make sense to have a POStatus which signifies an empty
> result which should be ignored? Or should we just use the
> representation of null for this. This should only arise in
> top-level/relational operators, I believe. I don't feel strongly about
> this, as long as we're consistent.
>
> -Mark
>
> On Tue, Nov 12, 2013 at 5:30 PM, Daniel Dai <[email protected]> wrote:
> > Hi, Mark,
> > I totally agree we shall optimize the EOP. The optimization I can think
> of
> > now is:
> > 1. In tez, we don't bound to one key per call mode, we can return EOP
> only
> > when we exhaust all keys (you've already did that)
> > 2. In MR, POJoinPackage currently return a EOP per flattened output, we
> can
> > optimize to use a EOP per key
> >
> > I still think STATUS_NULL just means a null value. I didn't find evidence
> > it is used to mean something else. Ignoring STATUS_NULL in the
> > runPipeline() should be right. That's because null tuple in a bag is
> valid,
> > but when we flatten to the top level null is not valid. I did trace some
> > join queries in MR code, I didn't find a case where we use STATUS_NULL to
> > signal a situation other than a null value. Did you have a scenario where
> > we use STATUS_NULL to signal something else?
> >
> > Thanks,
> > Daniel
> >
> >
> > On Tue, Nov 12, 2013 at 4:32 PM, Mark Wagner <[email protected]
> >wrote:
> >
> >> I'm definitely +1 on not having so many EOPs go through the pipeline
> >> in the reducer. I've seen good improvements for maps in Tez, and I
> >> suspect a join in classic MR could have similar gains.
> >>
> >> For STATUS_NULL: I don't agree that the semantics are clear right now.
> >> In the PigGenericMapReduce$Reducer::processOnePackageOutput(context)
> >> call that you mentioned, package output does have it's return status
> >> checked and if it's STATUS_NULL, it is discarded. At one point I
> >> suspected maybe STATUS_NULL really did just mean null and all the
> >> relational operators were just ignoring null values, but many of the
> >> expression operators have checks like (res.returnStatus ==
> >> POStatus.STATUS_OK && res.result != null).
> >>
> >> So it's certainly confusing to me and it doesn't seem that there's a
> >> concrete definition of STATUS_NULL right now. I don't think there's
> >> much utility in having a STATUS_NULL that indicates a null value when
> >> I can just check res.result. That's why I'm in favor of #2.
> >>
> >> I have a patch that has passed all of 'test-commit' and is now going
> >> through 'test'. I'll post this is in a JIRA + RB and we can have more
> >> discussion there.
> >>
> >> Thanks,
> >> Mark
> >>
> >> On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[email protected]>
> wrote:
> >> > Also in continuation to our talk last week, it seems we are doing
> >> something
> >> > inefficient in MR. I believe that should be a separate issue other
> than
> >> > STATUS_NULL, it is more than a semantic cleanup (actually semantic is
> >> clear
> >> > in existing code, just need some cleanup).
> >> >
> >> > Currently what Pig does is:
> >> > 1. pull one record from POPackage/POJoinPackage
> >> > 2. attach the record to pipeline
> >> > 3. pull the leaf of pipeline until EOP
> >> > 4. return to #1
> >> >
> >> > Seems to have a EOP to go through the entire pipeline for each
> >> > POPackage/POJoinPackage output is unnecessary. It can be optimized. It
> >> > would be interesting to explore the opportunity of the optimization
> here.
> >> >
> >> > Thanks,
> >> > Daniel
> >> >
> >> >
> >> > On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[email protected]>
> >> wrote:
> >> >
> >> >> We shall definitely clear the ambiguity. +1 for #1.
> >> >>
> >> >> For #2, I did some digging into POPackage. What I see is in
> POPackage we
> >> >> produce STATUS_NULL, but not POJoinPackage. And the return value for
> >> >> POPackage is not used:
> >> >> if (pack instanceof POJoinPackage) {
> >> >>     while (true)
> >> >>     {
> >> >>         ....
> >> >>         if (processOnePackageOutput(context))
> >> >>             break;
> >> >>     }
> >> >> }
> >> >> else {
> >> >>      ....
> >> >>      processOnePackageOutput(context);
> >> >> }
> >> >>
> >> >> So it seems STATUS_NULL is not effectively in use in
> >> >> POPackage/POJoinPackage. We shall always use STATUS_EOP to indicate
> the
> >> end
> >> >> of pull (for both POPackage and pipeline). We shall not use
> STATUS_NULL
> >> >> anywhere once we make #1 clear. Am I right?
> >> >>
> >> >> Thanks,
> >> >> Daniel
> >> >>
> >> >>
> >> >> On Tue, Nov 12, 2013 at 8:35 AM, Cheolsoo Park <[email protected]
> >> >wrote:
> >> >>
> >> >>> >> I'll post a patch that makes the second option official and
> changes
> >> all
> >> >>> the examples of the first to return null with STATUS_OK (this is
> what
> >> >>> happens with the constant null currently).
> >> >>>
> >> >>> I am +1 for this.
> >> >>>
> >> >>>
> >> >>> On Fri, Nov 8, 2013 at 10:16 PM, Mark Wagner <
> [email protected]
> >> >>> >wrote:
> >> >>>
> >> >>> > Hi Devs,
> >> >>> >
> >> >>> > While discussing the Pig on Tez project today, we were unable to
> >> reach
> >> >>> > a conclusion on what the semantics of what POStatus.STATUS_NULL
> are
> >> >>> > and would like to ask those who remember the history of Pig to
> chime
> >> >>> > in. The two interpretations are:
> >> >>> >
> >> >>> > 1. POStatus.STATUS_NULL indicates that the pulled output IS null.
> >> This
> >> >>> > is used in some places like EqualToExpr.
> >> >>> >
> >> >>> > 2. POStatus.STATUS_NULL indicates that the pull did not produce
> any
> >> >>> > output. This is backed up by it's usage in POPackage for
> flattening
> >> an
> >> >>> > empty bag, and PigGenericMapBase where pulls on the operator
> pipeline
> >> >>> > that result in STATUS_NULL are discarded.
> >> >>> >
> >> >>> > If nobody has any concrete documentation or explanation for the
> >> >>> > discrepancy, I'll post a patch that makes the second option
> official
> >> >>> > and changes all the examples of the first to return null with
> >> >>> > STATUS_OK (this is what happens with the constant null currently).
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Mark
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >> > --
> >> > CONFIDENTIALITY NOTICE
> >> > NOTICE: This message is intended for the use of the individual or
> entity
> >> to
> >> > which it is addressed and may contain information that is
> confidential,
> >> > privileged and exempt from disclosure under applicable law. If the
> reader
> >> > of this message is not the intended recipient, you are hereby notified
> >> that
> >> > any printing, copying, dissemination, distribution, disclosure or
> >> > forwarding of this communication is strictly prohibited. If you have
> >> > received this communication in error, please contact the sender
> >> immediately
> >> > and delete it from your system. Thank You.
> >>
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to