Thanks Daniel & Rohini, I have updated PIG-3839, change its title to "Integrate YSmart into Pig on tez" and add more comments on it.
On Fri, Mar 6, 2015 at 8:53 AM, Rohini Palaniswamy <[email protected]> wrote: > Jeff, > https://issues.apache.org/jira/browse/PIG-3839 is the umbrella jira > for Tez performance. Please file anything you identify in it if it is > already not there. > > Regards, > Rohini > > On Thu, Mar 5, 2015 at 4:50 PM, Rohini Palaniswamy < > [email protected]> wrote: > >> Jeff, >> There is already a JIRA - >> https://issues.apache.org/jira/browse/PIG-3849. You can update it with >> the details/diagrams. >> >> Regards, >> Rohini >> >> >> On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <[email protected]> wrote: >> >>> Thanks Jeff. I think mailing list does not allow attachment, but I get >>> your point. >>> >>> Yes, and there are actually a couple of more pattens like this: rank -> >>> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and >>> it can be done in a more general way similar to YSmart (HIVE-2206). The >>> question is the amount of work involved. Can you open a ticket to track it? >>> I don't think there is one yet. >>> >>> Daniel >>> >>> From: Jeff Zhang <[email protected]<mailto:[email protected]>> >>> Reply-To: "[email protected]<mailto:[email protected]>" < >>> [email protected]<mailto:[email protected]>> >>> Date: Thursday, March 5, 2015 at 6:30 AM >>> To: "[email protected]<mailto:[email protected]>" <[email protected] >>> <mailto:[email protected]>> >>> Subject: Re: Optimization opportunity for group by followed by join on >>> the same key ? >>> >>> Upload dag diagram again (someone told me it is not visible ) >>> [Inline image 1] >>> >>> On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <[email protected]<mailto: >>> [email protected]>> wrote: >>> Thanks Rajesh, will upload it to dev mail list again. >>> >>> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan < >>> [email protected]<mailto:[email protected]>> wrote: >>> Works fine. Thank you. Not sure if it got trimmed by dev mailing list. >>> I didn't see this diagram from the mailing list and thought of informing >>> you. >>> >>> ~Rajesh.B >>> >>> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <[email protected]<mailto: >>> [email protected]>> wrote: >>> upload the dag diagram again, hope it works this time >>> >>> >>> [Inline image 1] >>> >>> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan < >>> [email protected]<mailto:[email protected]>> wrote: >>> Hey Jeff, >>> >>> The diagram isn't visible. Can you please reattach the diagram? >>> >>> ~Rajesh.B >>> >>> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <[email protected]<mailto: >>> [email protected]>> wrote: >>> Hi folks, >>> >>> Here's my pig script: >>> >>> >>> a = load 'pig/input' as (x:int, y:chararray); >>> >>> b = load 'pig/input1' as (x:int, y:chararray); >>> >>> c = group a by x; >>> >>> d = foreach c generate groupas x, COUNT($1) as cnt; >>> >>> d = join d by x, b by x; >>> >>> store d into 'pig/output'; >>> >>> >>> I use tez as the execution engine and notice that pig would convert it >>> to one dag with 4 vertices as following. But I think 3 vertices should be >>> sufficient. Because the group by and join are using the same key >>> >>> So I think vertex (scop_39) is not necessary, we don't need to >>> repartition the data again. The only impact on converting 4 vertices to 3 >>> vertices may be on the parallelism of vertex (scope_41). Not sure how much >>> the performance difference between >>> these 2 methods, but think this could be a potential optimization. >>> >>> >>> >>> >>> >>> [Inline image 1] >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >>> >>> >>> -- >>> ~Rajesh.B >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >>> >>> >>> -- >>> ~Rajesh.B >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > -- Best Regards Jeff Zhang
