[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772404#action_12772404 ]
Ashutosh Chauhan commented on PIG-1038: --------------------------------------- I think its a useful optimization. I presume this will be implemented as a visitor in MapReduceLauncher which visits on compiled MR plan. Design looks good. I have few questions: bq. 1.1 Discover if we use sort/distinct in nested foreach plan. How are you planning to discover ? Depending on some pattern like LR in map-plan followed by POPackage, POForeach, POSort in reduce-plan? Kind of orthogonal but related to this issue. We have rule-based optimizer framework in front-end, it seems to me that similar optimizer framework is required in backend too to refactor all the optimizer visitors we currently have and to add similar kind of optimizations easily in future. There are seven optimizations in front-end expressed through rules. On the other hand after addition of this one we will have nine optimization visitors in backend. May be we can think about it to avoid lot of rework every time such optimization is added. > Optimize nested distinct/sort to use secondary key > -------------------------------------------------- > > Key: PIG-1038 > URL: https://issues.apache.org/jira/browse/PIG-1038 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.4.0 > Reporter: Olga Natkovich > Assignee: Daniel Dai > Fix For: 0.6.0 > > > If nested foreach plan contains sort/distinct, it is possible to use hadoop > secondary sort instead of SortedDataBag and DistinctDataBag to optimize the > query. > Eg1: > A = load 'mydata'; > B = group A by $0; > C = foreach B { > D = order A by $1; > generate group, D; > } > store C into 'myresult'; > We can specify a secondary sort on A.$1, and drop "order A by $1". > Eg2: > A = load 'mydata'; > B = group A by $0; > C = foreach B { > D = A.$1; > E = distinct D; > generate group, E; > } > store C into 'myresult'; > We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct > D" to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.