Re: Optimization opportunity for group by followed by join on the same key ?

Jeff Zhang Thu, 05 Mar 2015 18:12:08 -0800

Thanks Daniel & Rohini,  I have updated PIG-3839, change its title to
"Integrate YSmart into Pig on tez" and add more comments on it.



On Fri, Mar 6, 2015 at 8:53 AM, Rohini Palaniswamy <[email protected]>
wrote:

> Jeff,
>    https://issues.apache.org/jira/browse/PIG-3839 is the umbrella jira
> for Tez performance. Please file anything you identify in it if it is
> already not there.
>
> Regards,
> Rohini
>
> On Thu, Mar 5, 2015 at 4:50 PM, Rohini Palaniswamy <
> [email protected]> wrote:
>
>> Jeff,
>>    There is already a JIRA -
>> https://issues.apache.org/jira/browse/PIG-3849. You can update it with
>> the details/diagrams.
>>
>> Regards,
>> Rohini
>>
>>
>> On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <[email protected]> wrote:
>>
>>> Thanks Jeff. I think mailing list does not allow attachment, but I get
>>> your point.
>>>
>>> Yes, and there are actually a couple of more pattens like this: rank ->
>>> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and
>>> it can be done in a more general way similar to YSmart (HIVE-2206). The
>>> question is the amount of work involved. Can you open a ticket to track it?
>>> I don't think there is one yet.
>>>
>>> Daniel
>>>
>>> From: Jeff Zhang <[email protected]<mailto:[email protected]>>
>>> Reply-To: "[email protected]<mailto:[email protected]>" <
>>> [email protected]<mailto:[email protected]>>
>>> Date: Thursday, March 5, 2015 at 6:30 AM
>>> To: "[email protected]<mailto:[email protected]>" <[email protected]
>>> <mailto:[email protected]>>
>>> Subject: Re: Optimization opportunity for group by followed by join on
>>> the same key ?
>>>
>>> Upload dag diagram again (someone told me it is not visible )
>>> [Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <[email protected]<mailto:
>>> [email protected]>> wrote:
>>> Thanks Rajesh, will upload it to dev mail list again.
>>>
>>> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
>>> [email protected]<mailto:[email protected]>> wrote:
>>> Works fine.  Thank you. Not sure if it got trimmed by dev mailing list.
>>> I didn't see this diagram from the mailing list and thought of informing
>>> you.
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <[email protected]<mailto:
>>> [email protected]>> wrote:
>>> upload the dag diagram again, hope it works this time
>>>
>>>
>>> [Inline image 1]
>>>
>>> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
>>> [email protected]<mailto:[email protected]>> wrote:
>>> Hey Jeff,
>>>
>>> The diagram isn't visible.  Can you please reattach the diagram?
>>>
>>> ~Rajesh.B
>>>
>>> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <[email protected]<mailto:
>>> [email protected]>> wrote:
>>> Hi folks,
>>>
>>> Here's my pig script:
>>>
>>>
>>>     a = load 'pig/input' as (x:int, y:chararray);
>>>
>>>     b = load 'pig/input1' as (x:int, y:chararray);
>>>
>>>     c = group a by x;
>>>
>>>     d = foreach c generate groupas x, COUNT($1) as cnt;
>>>
>>>     d = join d by x, b by x;
>>>
>>>     store d into 'pig/output';
>>>
>>>
>>> I use tez as the execution engine and notice that pig would convert it
>>> to one dag with 4 vertices as following. But I think 3 vertices should be
>>> sufficient. Because the group by and join are using the same key
>>>
>>> So I think vertex (scop_39) is not necessary, we don't need to
>>> repartition the data again. The only impact on converting 4 vertices to 3
>>> vertices may be on the parallelism of vertex (scope_41). Not sure how much
>>> the performance difference between
>>> these 2 methods, but think this could be a potential optimization.
>>>
>>>
>>>
>>>
>>>
>>> [Inline image 1]
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> ~Rajesh.B
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: Optimization opportunity for group by followed by join on the same key ?

Reply via email to