The main case that motivated my 'no result' interpretation was line
299 in POPackage where flattening empty bags results in STATUS_NULL
and that many of the expression operators do not check STATUS_NULL,
but check result != null instead.

I think we can break this down into two parts, each of which are
probably more easily agreed upon:

1. Is it necessary to have a POStatus just to signify null? Why can't
we just set result = null with STATUS_OK? (This should only arise in
expression plans) (I'd lean towards using null and STATUS_OK)
2. Does it make sense to have a POStatus which signifies an empty
result which should be ignored? Or should we just use the
representation of null for this. This should only arise in
top-level/relational operators, I believe. I don't feel strongly about
this, as long as we're consistent.

-Mark

On Tue, Nov 12, 2013 at 5:30 PM, Daniel Dai <[email protected]> wrote:
> Hi, Mark,
> I totally agree we shall optimize the EOP. The optimization I can think of
> now is:
> 1. In tez, we don't bound to one key per call mode, we can return EOP only
> when we exhaust all keys (you've already did that)
> 2. In MR, POJoinPackage currently return a EOP per flattened output, we can
> optimize to use a EOP per key
>
> I still think STATUS_NULL just means a null value. I didn't find evidence
> it is used to mean something else. Ignoring STATUS_NULL in the
> runPipeline() should be right. That's because null tuple in a bag is valid,
> but when we flatten to the top level null is not valid. I did trace some
> join queries in MR code, I didn't find a case where we use STATUS_NULL to
> signal a situation other than a null value. Did you have a scenario where
> we use STATUS_NULL to signal something else?
>
> Thanks,
> Daniel
>
>
> On Tue, Nov 12, 2013 at 4:32 PM, Mark Wagner <[email protected]>wrote:
>
>> I'm definitely +1 on not having so many EOPs go through the pipeline
>> in the reducer. I've seen good improvements for maps in Tez, and I
>> suspect a join in classic MR could have similar gains.
>>
>> For STATUS_NULL: I don't agree that the semantics are clear right now.
>> In the PigGenericMapReduce$Reducer::processOnePackageOutput(context)
>> call that you mentioned, package output does have it's return status
>> checked and if it's STATUS_NULL, it is discarded. At one point I
>> suspected maybe STATUS_NULL really did just mean null and all the
>> relational operators were just ignoring null values, but many of the
>> expression operators have checks like (res.returnStatus ==
>> POStatus.STATUS_OK && res.result != null).
>>
>> So it's certainly confusing to me and it doesn't seem that there's a
>> concrete definition of STATUS_NULL right now. I don't think there's
>> much utility in having a STATUS_NULL that indicates a null value when
>> I can just check res.result. That's why I'm in favor of #2.
>>
>> I have a patch that has passed all of 'test-commit' and is now going
>> through 'test'. I'll post this is in a JIRA + RB and we can have more
>> discussion there.
>>
>> Thanks,
>> Mark
>>
>> On Tue, Nov 12, 2013 at 3:19 PM, Daniel Dai <[email protected]> wrote:
>> > Also in continuation to our talk last week, it seems we are doing
>> something
>> > inefficient in MR. I believe that should be a separate issue other than
>> > STATUS_NULL, it is more than a semantic cleanup (actually semantic is
>> clear
>> > in existing code, just need some cleanup).
>> >
>> > Currently what Pig does is:
>> > 1. pull one record from POPackage/POJoinPackage
>> > 2. attach the record to pipeline
>> > 3. pull the leaf of pipeline until EOP
>> > 4. return to #1
>> >
>> > Seems to have a EOP to go through the entire pipeline for each
>> > POPackage/POJoinPackage output is unnecessary. It can be optimized. It
>> > would be interesting to explore the opportunity of the optimization here.
>> >
>> > Thanks,
>> > Daniel
>> >
>> >
>> > On Tue, Nov 12, 2013 at 2:38 PM, Daniel Dai <[email protected]>
>> wrote:
>> >
>> >> We shall definitely clear the ambiguity. +1 for #1.
>> >>
>> >> For #2, I did some digging into POPackage. What I see is in POPackage we
>> >> produce STATUS_NULL, but not POJoinPackage. And the return value for
>> >> POPackage is not used:
>> >> if (pack instanceof POJoinPackage) {
>> >>     while (true)
>> >>     {
>> >>         ....
>> >>         if (processOnePackageOutput(context))
>> >>             break;
>> >>     }
>> >> }
>> >> else {
>> >>      ....
>> >>      processOnePackageOutput(context);
>> >> }
>> >>
>> >> So it seems STATUS_NULL is not effectively in use in
>> >> POPackage/POJoinPackage. We shall always use STATUS_EOP to indicate the
>> end
>> >> of pull (for both POPackage and pipeline). We shall not use STATUS_NULL
>> >> anywhere once we make #1 clear. Am I right?
>> >>
>> >> Thanks,
>> >> Daniel
>> >>
>> >>
>> >> On Tue, Nov 12, 2013 at 8:35 AM, Cheolsoo Park <[email protected]
>> >wrote:
>> >>
>> >>> >> I'll post a patch that makes the second option official and changes
>> all
>> >>> the examples of the first to return null with STATUS_OK (this is what
>> >>> happens with the constant null currently).
>> >>>
>> >>> I am +1 for this.
>> >>>
>> >>>
>> >>> On Fri, Nov 8, 2013 at 10:16 PM, Mark Wagner <[email protected]
>> >>> >wrote:
>> >>>
>> >>> > Hi Devs,
>> >>> >
>> >>> > While discussing the Pig on Tez project today, we were unable to
>> reach
>> >>> > a conclusion on what the semantics of what POStatus.STATUS_NULL are
>> >>> > and would like to ask those who remember the history of Pig to chime
>> >>> > in. The two interpretations are:
>> >>> >
>> >>> > 1. POStatus.STATUS_NULL indicates that the pulled output IS null.
>> This
>> >>> > is used in some places like EqualToExpr.
>> >>> >
>> >>> > 2. POStatus.STATUS_NULL indicates that the pull did not produce any
>> >>> > output. This is backed up by it's usage in POPackage for flattening
>> an
>> >>> > empty bag, and PigGenericMapBase where pulls on the operator pipeline
>> >>> > that result in STATUS_NULL are discarded.
>> >>> >
>> >>> > If nobody has any concrete documentation or explanation for the
>> >>> > discrepancy, I'll post a patch that makes the second option official
>> >>> > and changes all the examples of the first to return null with
>> >>> > STATUS_OK (this is what happens with the constant null currently).
>> >>> >
>> >>> > Thanks,
>> >>> > Mark
>> >>> >
>> >>>
>> >>
>> >>
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or entity
>> to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Reply via email to