[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

Ashutosh Chauhan (JIRA) Sun, 01 Nov 2009 17:37:25 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772404#action_12772404
 ]


Ashutosh Chauhan commented on PIG-1038:
---------------------------------------

I think its a useful optimization. I presume this will be implemented as a 
visitor in MapReduceLauncher which visits on compiled MR plan. Design looks 
good. I have few questions:

bq. 1.1 Discover if we use sort/distinct in nested foreach plan.
How are you planning to discover ? Depending on some pattern like LR in 
map-plan followed by POPackage, POForeach, POSort  in reduce-plan?

Kind of orthogonal but related to this issue. We have rule-based optimizer 
framework in front-end, it seems to me that similar optimizer framework is 
required in backend too to refactor all the optimizer visitors we currently 
have and to add  similar kind of optimizations easily in future. 
There are seven optimizations in front-end expressed through rules. On the 
other hand after addition of this one we will have nine optimization visitors 
in backend. May be we can think about it to avoid lot of rework every time such 
optimization is added.

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop 
> secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
> query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct 
> D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

Reply via email to