[
https://issues.apache.org/jira/browse/PIG-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047267#comment-13047267
]
Thejas M Nair commented on PIG-1846:
------------------------------------
The optimizations proposed above is applicable for only cases where the
distinct happens on only one column or a single set of columns .
for example, it is applicable for-
res = FOREACH gby GENERATE group.c1, group.c2, FUNC(distinct in.c3);
res = FOREACH gby GENERATE group.c1, group.c2, FUNC1(distinct in.c3),
FUNC2(distinct in.c3); -- distinct on same column used in two functions
res = FOREACH gby GENERATE group.c1, group.c2, FUNC(distinct in.(c3,c4)); --
distinct on multiple columns
res = FOREACH gby GENERATE group.c1, group.c2, FUNC1(distinct in.(c3,c4)),
FUNC2(distinct in.(c3,c4)); -- distinct on same set of multiple columns, used
in two functions
It is not applicable for -
res = FOREACH gby GENERATE group.c1, group.c2, FUNC(distinct in.c3),
FUNC(distinct in.c4); -- the two udfs have distinct on two different udfs.
FYI, the examples here also using unsupported syntax -
res = FOREACH gby GENERATE group.c1, group.c2, FUNC(DISTINCT in.c3);
should actually be -
res = FOREACH gby { dist_c3 = DISTINCT in.c3; GENERATE group.c1, group.c2,
FUNC(dist_c3);}
> optimize queries like - count distinct users for each gender
> ------------------------------------------------------------
>
> Key: PIG-1846
> URL: https://issues.apache.org/jira/browse/PIG-1846
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Thejas M Nair
> Fix For: 0.10
>
>
> The pig group operation does not usually have to deal with skew on the
> group-by keys if the foreach statement that works on the results of group has
> only algebraic functions on the bags. But for some queries like the
> following, skew can be a problem -
> {code}
> user_data = load 'file' as (user, gender, age);
> user_group_gender = group user_data by gender parallel 100;
> dist_users_per_gender = foreach user_group_gender
> {
> dist_user = distinct user_data.user;
> generate group as gender, COUNT(dist_user) as
> user_count;
> }
> {code}
> Since there are only 2 distinct values of the group-by key, only 2 reducers
> will actually get used in current implementation. ie, you can't get better
> performance by adding more reducers.
> Similar problem is there when the data is skewed on the group key. With
> current implementation, another problem is that pig and MR has to deal with
> records with extremely large bags that have the large number of distinct user
> names, which results in high memory utilization and having to spill the bags
> to disk.
> The query plan should be modified to handle the skew in such cases and make
> use of more reducers.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira