[ 
https://issues.apache.org/jira/browse/PIG-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047231#comment-13047231
 ] 

Thejas M Nair commented on PIG-1846:
------------------------------------

bq. The DISTINCT optimization is often not applicable; consider, for example, a 
script that takes all pages on a website and generates COUNT(impressions), 
COUNT(distinct users). Doing the distinct operation first means we can no 
longer do COUNT(impressions). 
Yes, that optimization will not be applicable for this use case.

The translation you proposed helps to distribute the work of computing 
ALGFUNC(in.c4) across multiple tasks (even when there is skew on c1,c2). But 
FUNC(res_dist.c3) will still get computed in reduce side (ie, all records for a 
value of c1,c2 will go to one reduce), as combiner will not get used. This is 
because  ALGFUNC$Final is not algebraic.

One cumbersome workaround for user is to write a new udf ALGFUNC_2 which is 
same as ALGFUNC, except for having ALGFUNC_2$Initial same as ALGFUNC$Intermed . 
This ALGFUNC_2 then gets used in the last foreach .
Pig can automate this logic, and use combiner for the last foreach in above 
examples translation.





> optimize queries like - count distinct users for each gender
> ------------------------------------------------------------
>
>                 Key: PIG-1846
>                 URL: https://issues.apache.org/jira/browse/PIG-1846
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.10
>
>
> The pig group operation does not usually have to deal with skew on the 
> group-by keys if the foreach statement that works on the results of group has 
> only algebraic functions on the bags. But for some queries like the 
> following, skew can be a problem -
> {code}
> user_data = load 'file' as (user, gender, age);
> user_group_gender = group user_data by gender parallel 100;
> dist_users_per_gender = foreach user_group_gender 
>                         { 
>                              dist_user = distinct user_data.user; 
>                              generate group as gender, COUNT(dist_user) as 
> user_count;
>                         }
> {code}
> Since there are only 2 distinct values of the group-by key, only 2 reducers 
> will actually get used in current implementation. ie, you can't get better 
> performance by adding more reducers.
> Similar problem is there when the data is skewed on the group key. With 
> current implementation, another problem is that pig and MR has to deal with 
> records with extremely large bags that have the large number of distinct user 
> names, which results in high memory utilization and having to spill the bags 
> to disk.
> The query plan should be modified to handle the skew in such cases and make 
> use of more reducers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to