Thanks Jeff. I think mailing list does not allow attachment, but I get your point.
Yes, and there are actually a couple of more pattens like this: rank -> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and it can be done in a more general way similar to YSmart (HIVE-2206). The question is the amount of work involved. Can you open a ticket to track it? I don't think there is one yet. Daniel From: Jeff Zhang <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Thursday, March 5, 2015 at 6:30 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Optimization opportunity for group by followed by join on the same key ? Upload dag diagram again (someone told me it is not visible ) [Inline image 1] On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <[email protected]<mailto:[email protected]>> wrote: Thanks Rajesh, will upload it to dev mail list again. On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <[email protected]<mailto:[email protected]>> wrote: Works fine. Thank you. Not sure if it got trimmed by dev mailing list. I didn't see this diagram from the mailing list and thought of informing you. ~Rajesh.B On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <[email protected]<mailto:[email protected]>> wrote: upload the dag diagram again, hope it works this time [Inline image 1] On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <[email protected]<mailto:[email protected]>> wrote: Hey Jeff, The diagram isn't visible. Can you please reattach the diagram? ~Rajesh.B On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <[email protected]<mailto:[email protected]>> wrote: Hi folks, Here's my pig script: a = load 'pig/input' as (x:int, y:chararray); b = load 'pig/input1' as (x:int, y:chararray); c = group a by x; d = foreach c generate groupas x, COUNT($1) as cnt; d = join d by x, b by x; store d into 'pig/output'; I use tez as the execution engine and notice that pig would convert it to one dag with 4 vertices as following. But I think 3 vertices should be sufficient. Because the group by and join are using the same key So I think vertex (scop_39) is not necessary, we don't need to repartition the data again. The only impact on converting 4 vertices to 3 vertices may be on the parallelism of vertex (scope_41). Not sure how much the performance difference between these 2 methods, but think this could be a potential optimization. [Inline image 1] -- Best Regards Jeff Zhang -- ~Rajesh.B -- Best Regards Jeff Zhang -- ~Rajesh.B -- Best Regards Jeff Zhang -- Best Regards Jeff Zhang
