[ 
https://issues.apache.org/jira/browse/PIG-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244477#comment-14244477
 ] 

Cheolsoo Park commented on PIG-4066:
------------------------------------

+1.

I will commit this patch today. This optimization is disabled by default and 
only applicable to MR, so it shouldn't break anything. Nevertheless, I ran full 
unit tests and e2e tests, and both were clean.

[~hxquangnhat], we should document this. Do you mind opening another jira to 
add document? I think 
[optimization-rules|http://pig.apache.org/docs/r0.13.0/perf.html#optimization-rules]
 is the best place to put it.

> An optimization for ROLLUP operation in Pig
> -------------------------------------------
>
>                 Key: PIG-4066
>                 URL: https://issues.apache.org/jira/browse/PIG-4066
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Quang-Nhat HOANG-XUAN
>            Assignee: Quang-Nhat HOANG-XUAN
>              Labels: hybrid-irg, optimization, rollup
>         Attachments: Current Rollup vs Our Rollup.jpg, PIG-4066.2.patch, 
> PIG-4066.3.patch, PIG-4066.4.patch, PIG-4066.5.patch, PIG-4066.patch, 
> TechnicalNotes.2.pdf, TechnicalNotes.pdf, UserGuide.pdf
>
>
> This patch aims at addressing the current limitation of the ROLLUP operator 
> in PIG: most of the work is done in the Map phase of the underlying MapReduce 
> job to generate all possible intermediate keys that the reducer use to 
> aggregate and produce the ROLLUP output. Based on our previous work: 
> “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of 
> MapReduce ROLLUP aggregates” 
> (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we 
> show that the design space for a ROLLUP implementation allows for a different 
> approach (in-reducer grouping, IRG), in which less work is done in the Map 
> phase and the grouping is done in the Reduce phase. This patch presents the 
> most efficient implementation we designed (Hybrid IRG), which allows defining 
> a parameter to balance between parallelism (in the reducers) and 
> communication cost.
> This patch contains the following features:
> 1. The new ROLLUP approach: IRG, Hybrid IRG.
> 2. The PIVOT clause in CUBE operators.
> 3. Test cases.
> The new syntax to use our ROLLUP approach:
> alias = CUBE rel BY { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { 
> CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
> In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP 
> operator will be executed with our approach (IRG, Hybrid IRG) while the 
> remaining ROLLUP ahead will be executed with the default approach.
> We have already made some experiments for comparison between our ROLLUP 
> implementation and the current ROLLUP. More information can be found at here: 
> http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
> Patch can be reviewed at here: https://reviews.apache.org/r/23804/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to