Re: Optimization opportunity for group by followed by join on the same key ?

Rohini Palaniswamy Thu, 05 Mar 2015 16:53:17 -0800

Jeff,
   There is already a JIRA - https://issues.apache.org/jira/browse/PIG-3849.
You can update it with the details/diagrams.


Regards,
Rohini

On Thu, Mar 5, 2015 at 9:41 AM, Daniel Dai <[email protected]> wrote:

> Thanks Jeff. I think mailing list does not allow attachment, but I get
> your point.
>
> Yes, and there are actually a couple of more pattens like this: rank ->
> sort, join -> sort, sort -> distinct, etc. This certainly can be done, and
> it can be done in a more general way similar to YSmart (HIVE-2206). The
> question is the amount of work involved. Can you open a ticket to track it?
> I don't think there is one yet.
>
> Daniel
>
> From: Jeff Zhang <[email protected]<mailto:[email protected]>>
> Reply-To: "[email protected]<mailto:[email protected]>" <
> [email protected]<mailto:[email protected]>>
> Date: Thursday, March 5, 2015 at 6:30 AM
> To: "[email protected]<mailto:[email protected]>" <[email protected]
> <mailto:[email protected]>>
> Subject: Re: Optimization opportunity for group by followed by join on the
> same key ?
>
> Upload dag diagram again (someone told me it is not visible )
> [Inline image 1]
>
> On Thu, Mar 5, 2015 at 10:28 PM, Jeff Zhang <[email protected]<mailto:
> [email protected]>> wrote:
> Thanks Rajesh, will upload it to dev mail list again.
>
> On Thu, Mar 5, 2015 at 10:22 PM, Rajesh Balamohan <
> [email protected]<mailto:[email protected]>> wrote:
> Works fine.  Thank you. Not sure if it got trimmed by dev mailing list.  I
> didn't see this diagram from the mailing list and thought of informing you.
>
> ~Rajesh.B
>
> On Thu, Mar 5, 2015 at 7:46 PM, Jeff Zhang <[email protected]<mailto:
> [email protected]>> wrote:
> upload the dag diagram again, hope it works this time
>
>
> [Inline image 1]
>
> On Thu, Mar 5, 2015 at 8:25 PM, Rajesh Balamohan <
> [email protected]<mailto:[email protected]>> wrote:
> Hey Jeff,
>
> The diagram isn't visible.  Can you please reattach the diagram?
>
> ~Rajesh.B
>
> On Thu, Mar 5, 2015 at 3:06 PM, Jeff Zhang <[email protected]<mailto:
> [email protected]>> wrote:
> Hi folks,
>
> Here's my pig script:
>
>
>     a = load 'pig/input' as (x:int, y:chararray);
>
>     b = load 'pig/input1' as (x:int, y:chararray);
>
>     c = group a by x;
>
>     d = foreach c generate groupas x, COUNT($1) as cnt;
>
>     d = join d by x, b by x;
>
>     store d into 'pig/output';
>
>
> I use tez as the execution engine and notice that pig would convert it to
> one dag with 4 vertices as following. But I think 3 vertices should be
> sufficient. Because the group by and join are using the same key
>
> So I think vertex (scop_39) is not necessary, we don't need to repartition
> the data again. The only impact on converting 4 vertices to 3 vertices may
> be on the parallelism of vertex (scope_41). Not sure how much the
> performance difference between
> these 2 methods, but think this could be a potential optimization.
>
>
>
>
>
> [Inline image 1]
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> ~Rajesh.B
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> ~Rajesh.B
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Optimization opportunity for group by followed by join on the same key ?

Reply via email to