[ 
https://issues.apache.org/jira/browse/PIG-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188657#comment-13188657
 ] 

Jie Li commented on PIG-2423:
-----------------------------

Thanks Thejas. For this moment I just paste here. I add two cases, and I'm 
thinking if they can be more general. Feel free to improve them.

{code}
1. Use COGROUP to do the join

When there are GROUP-BY and JOIN on the same keys, we can usually combine them 
using COGROUP to reduce the number of MapReduce jobs. 

-- Query 1
A = load 'myfile' as (x, u, v);
B = load 'myotherfile' as (x, y, z);

t1 = group B by B.x;
t2 = foreach t1 generate group as x, COUNT(B.y) as count_y;
t3 = join A by A.x, t2 by t2.x;

-- Query 2
A = load 'myfile' as (x, u, v);
B = load 'myotherfile' as (x, y, z);

t1 = cogroup A by A.x, B by B.x;
t2 = filter t1 by NOT IsEmpty(A) AND NOT IsEmpty(B); -- an inner join
t3 = foreach t2 generate group, COUNT(B.y);

While the Query 1 requires two separate MR jobs, the Query 2 only requires one 
MR job by using the COGROUP.

2. Use GROUP+FLATTEN to do the self join

Sometimes we need a self join to get some additional information. For example, 
for each employer, find the average salary in his/her department.

-- Query 1
A = load 'myfile' as (name, salary, department);
t1 = group A by department;
t2 = foreach t1 generate group, AVG(A.salary) as avg_salary;
t3 = join A by department, t2 by group;

-- Query 2
A = load 'myfile' as (name, salary, department);
t1 = group A by department;
t2 = foreach t1 generate FLATTEN(A),  AVG(A.salary) as avg_salary;

While the Query 1 needs two MR jobs, the Query 2 only requires one MR job by 
using FLATTEN after GROUP to implement the self join.
{code}
                
> document use case where co-group is better choice than join 
> ------------------------------------------------------------
>
>                 Key: PIG-2423
>                 URL: https://issues.apache.org/jira/browse/PIG-2423
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Thejas M Nair
>             Fix For: 0.10
>
>
> Optimization rules 2 and 3 suggested in 
> https://issues.apache.org/jira/secure/attachment/12506841/pig_tpch.ppt 
> (PIG-2397) recommend the use of co-group instead of  join in certain cases. 
> These should be documented in pig performance page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to