[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai updated PIG-1038: ---------------------------- Status: Open (was: Patch Available) > Optimize nested distinct/sort to use secondary key > -------------------------------------------------- > > Key: PIG-1038 > URL: https://issues.apache.org/jira/browse/PIG-1038 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.4.0 > Reporter: Olga Natkovich > Assignee: Daniel Dai > Fix For: 0.6.0 > > Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, > PIG-1038-4.patch > > > If nested foreach plan contains sort/distinct, it is possible to use hadoop > secondary sort instead of SortedDataBag and DistinctDataBag to optimize the > query. > Eg1: > A = load 'mydata'; > B = group A by $0; > C = foreach B { > D = order A by $1; > generate group, D; > } > store C into 'myresult'; > We can specify a secondary sort on A.$1, and drop "order A by $1". > Eg2: > A = load 'mydata'; > B = group A by $0; > C = foreach B { > D = A.$1; > E = distinct D; > generate group, E; > } > store C into 'myresult'; > We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct > D" to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.