[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

Daniel Dai (JIRA) Fri, 30 Oct 2009 14:04:23 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772085#action_12772085
 ]


Daniel Dai commented on PIG-1038:
---------------------------------

Here is the design for this optimization:
1. Add SecondaryKeyOptimizer, which optimize map-reduce plan. It will
1.1 Discover if we use sort/distinct in nested foreach plan. 
1.2 For the first such sort/distinct, use the sort/distinct key as the 
secondary key
1.3 Once SecondaryKeyOptimizer discover secondary key, it will call 
POLocalRearrange.setSecondaryPlan, then drop sort or simplify distinct

2. Change POLocalRearrange
2.1 Add setSecondaryPlan to provide a way to set secondary plan for 
SecondaryKeyOptimizer
2.2 Change constructLROutput to make a compound key, which is a tuple: (key, 
secondaryKey)
2.3 We need to duplicate the logic to strip key from values for the secondary 
key as well

3. Change POPackageAnnotator to patch POPackage with the keyinfo from both key 
and secondaryKey

4. Change POPackage to stitch secondary key to the value

5. Change MapReduceOper to indicate that map-reduce operator needs secondary 
key, and JobControlCompiler will set OutputValueGroupingComparator to use the 
mainKeyComparator

6. Add mainKeyComparator which inherits PigNullableWritable and only compare 
the main key. We need that for the OutputValueGroupingComparator

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop 
> secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
> query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct 
> D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

Reply via email to