Daniel Dai commented on PIG-1038:

Couple of limitations for the currently implementation and will be addressed 
1. If sort is not nested inside foreach plan, it will not optimized. That's the 
issue with merge join indexing which Ashutosh find.

2. All the distinct keys are assumed to be ascending sort. Actually for the 
distinct key, the order of sort is flexible, if descending sort is cheaper, we 
shall use descending sort. Eg:
    C = foreach B { C1 = order A by $0 desc; C2 = C1.$0; C3 = distinct C2; 
generate group, C3; }
    Both order by and distinct is on the same key A.$0; however, order by use 
descending order. If we use descending A.$0 as secondary key, we shall able to 
remove both order by and distinct. This is not the case now. We can only remove 
order by and leave distinct. 

3. Main key has the same issue. Main key is group key and order does not 
matters. However, in current implementation, we assume ascending order. Eg:
    B = group a by (a0, a1);
    C = foreach B { C1 = order A by a0 desc; generate group, C1; }
    We use (a0, a1) as the main key, and nested order by cannot be removed. 
However, if we use reverse the order of main key, then we can remove the order 

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>         Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
> PIG-1038-4.patch, PIG-1038-5.patch
> If nested foreach plan contains sort/distinct, it is possible to use hadoop 
> secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
> query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct 
> D" to a special version of distinct, which does not do the sorting.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to