Thanks Jeff. I think mailing list does not allow attachment, but I get your 
point.

Yes, and there are actually a couple of more pattens like this: rank -> sort, 
join -> sort, sort -> distinct, etc. This certainly can be done, and it can be 
done in a more general way similar to YSmart (HIVE-2206). The question is the 
amount of work involved. Can you open a ticket to track it? I don't think there 
is one yet.

Daniel

From: Jeff Zhang <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, March 5, 2015 at 6:30 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Optimization opportunity for group by followed by join on the same 
key ?

Upload dag diagram again (someone told me it is not visible )
[Inline image 1]

On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Rajesh, will upload it to dev mail list again.

On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan 
<[email protected]<mailto:[email protected]>> wrote:
Works fine.  Thank you. Not sure if it got trimmed by dev mailing list.  I 
didn't see this diagram from the mailing list and thought of informing you.

~Rajesh.B

On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang 
<[email protected]<mailto:[email protected]>> wrote:
upload the dag diagram again, hope it works this time


[Inline image 1]

On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan 
<[email protected]<mailto:[email protected]>> wrote:
Hey Jeff,

The diagram isn't visible.  Can you please reattach the diagram?

~Rajesh.B

On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang 
<[email protected]<mailto:[email protected]>> wrote:
Hi folks,

Here's my pig script:


    a = load 'pig/input' as (x:int, y:chararray);

    b = load 'pig/input1' as (x:int, y:chararray);

    c = group a by x;

    d = foreach c generate groupas x, COUNT($1) as cnt;

    d = join d by x, b by x;

    store d into 'pig/output';


I use tez as the execution engine and notice that pig would convert it to one 
dag with 4 vertices as following. But I think 3 vertices should be sufficient. 
Because the group by and join are using the same key

So I think vertex (scop_39) is not necessary, we don't need to repartition the 
data again. The only impact on converting 4 vertices to 3 vertices may be on 
the parallelism of vertex (scope_41). Not sure how much the performance 
difference between
these 2 methods, but think this could be a potential optimization.





[Inline image 1]



--
Best Regards

Jeff Zhang



--
~Rajesh.B



--
Best Regards

Jeff Zhang



--
~Rajesh.B



--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang

Reply via email to