Jeff, There is already a JIRA - https://issues.apache.org/jira/browse/PIG-3849. You can update it with the details/diagrams.
Regards, Rohini On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <[email protected]> wrote: > Thanks Jeff. I think mailing list does not allow attachment, but I get > your point. > > Yes, and there are actually a couple of more pattens like this: rank -> > sort, join -> sort, sort -> distinct, etc. This certainly can be done, and > it can be done in a more general way similar to YSmart (HIVE-2206). The > question is the amount of work involved. Can you open a ticket to track it? > I don't think there is one yet. > > Daniel > > From: Jeff Zhang <[email protected]<mailto:[email protected]>> > Reply-To: "[email protected]<mailto:[email protected]>" < > [email protected]<mailto:[email protected]>> > Date: Thursday, March 5, 2015 at 6:30 AM > To: "[email protected]<mailto:[email protected]>" <[email protected] > <mailto:[email protected]>> > Subject: Re: Optimization opportunity for group by followed by join on the > same key ? > > Upload dag diagram again (someone told me it is not visible ) > [Inline image 1] > > On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <[email protected]<mailto: > [email protected]>> wrote: > Thanks Rajesh, will upload it to dev mail list again. > > On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan < > [email protected]<mailto:[email protected]>> wrote: > Works fine. Thank you. Not sure if it got trimmed by dev mailing list. I > didn't see this diagram from the mailing list and thought of informing you. > > ~Rajesh.B > > On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <[email protected]<mailto: > [email protected]>> wrote: > upload the dag diagram again, hope it works this time > > > [Inline image 1] > > On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan < > [email protected]<mailto:[email protected]>> wrote: > Hey Jeff, > > The diagram isn't visible. Can you please reattach the diagram? > > ~Rajesh.B > > On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <[email protected]<mailto: > [email protected]>> wrote: > Hi folks, > > Here's my pig script: > > > a = load 'pig/input' as (x:int, y:chararray); > > b = load 'pig/input1' as (x:int, y:chararray); > > c = group a by x; > > d = foreach c generate groupas x, COUNT($1) as cnt; > > d = join d by x, b by x; > > store d into 'pig/output'; > > > I use tez as the execution engine and notice that pig would convert it to > one dag with 4 vertices as following. But I think 3 vertices should be > sufficient. Because the group by and join are using the same key > > So I think vertex (scop_39) is not necessary, we don't need to repartition > the data again. The only impact on converting 4 vertices to 3 vertices may > be on the parallelism of vertex (scope_41). Not sure how much the > performance difference between > these 2 methods, but think this could be a potential optimization. > > > > > > [Inline image 1] > > > > -- > Best Regards > > Jeff Zhang > > > > -- > ~Rajesh.B > > > > -- > Best Regards > > Jeff Zhang > > > > -- > ~Rajesh.B > > > > -- > Best Regards > > Jeff Zhang > > > > -- > Best Regards > > Jeff Zhang >
