[ 
https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235214#comment-13235214
 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Hello everyone

I am Prasanth Jayachandran, graduate student at The Ohio State University. I am 
working with Prof. Arnab Nandi for providing CUBE operator support in Pig. I am 
submitting the initial version of the CUBE operator implementation(naive 
version of cube materialization). As this is my first patch submission I am 
really excited about it and am hoping to continue my contribution for Apache 
Pig. Please review the attached patch and provide feedback from improvising it.

Following contents explains the design decision and some initial performance 
numbers (experiments performed on single node pseudo-distributed hadoop setup). 
*Pig syntax for Cubing*
CUBE rel BY (a,b,c);

*SQL/Oracle syntax for Cubing*
GROUP BY CUBE (a,b,c);

*CUBE operator internals*
The CUBE operator injects the logical plan for following operators
x = FOREACH rel GENERATE FLATTEN(CubeDimensions(a,b,c));
y = GROUP x by (a,b,c);

*What is the output schema of CUBE operator?*
{group: tuple(a,b,c), cube: bag{(dimensions::a,dimensions::b,dimensions::c)}}

*Why syntactically different from SQL/Oracle?*
- Easier to implement as it does not modify or break the existing GROUP BY 
operator implementation
- CUBE operator might require separate flags for the following
        - Switching between BUC and STAR cubing (future optimization)
        - HAVING clause for monotonic operations
        - ROLLUP/DRILLDOWN operations 
        - Hint the location of partially computed CUBE
        - user specified inputs (example: algebraic attribute for converting 
the holistic measure to partially algebraic measure can be specified by user)
- Some operations applicable in GROUP operator are not applicable for CUBE 
        - Constant expression evaluation 
        - Duplicate column projection
- Follows Pig language design principle of procedural simplicity

*Corner case handling*
Constant expressions can be provided in GROUP BY operator. Constant expressions 
support has been removed from CUBE BY operator grammar. If constant expressions 
are used with CUBE BY, FrontEndException is thrown.
Duplicate column projection is supported in GROUP BY. Duplicates columns will 
be eliminated while generating logical plan. Current implementation ignores 
duplicates dimensions in CUBE BY. This can also be modified to throw exception 
if user repeats a dimension more than once. 
If cube dimensions are a subset of columns in input schema then the remaining 
columns in the input schema will be pushed to the “cube” bag.
For example: 
inp = LOAD ‘/pig/data/input’ AS (a,b,c,d);
x = CUBE inp BY (a,b);
schema of x will be {group:tuple(a,b), 
cube:bag(dimensions::a,dimensions::b,c,d)}

*Performance*
*Apache Pig Test Environment*
OS: Ubuntu 11.04 running as guest OS in Virtual Box
CPU Cores: 2 (4 Threads)
Memory: 8GB
HDD: 100GB 
Mode: Single node pseudo-distributed mode setup running Hadoop-0.20.2.
Configuration: Default configurations of hadoop 
*SQL Server 2008 R2 Test Environment*
OS: Windows 7
CPU Cores: 4 (8 Threads)
Memory: 16GB
HDD: 500GB
!Pig-Cubing-Performance.png!

*Acknowledgements*
Professor Arnab Nandi, Department of Computer Science and Engineering, The Ohio 
State University, Columbus for guidance and assistance throughout the course of 
this initial implementation.
Chaitanya Solarpurikar, Graduate Student, Department of Computer Science and 
Engineering, The Ohio State University, Columbus for setting up SQL server test 
environment and running performance comparison experiments.
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012
>         Attachments: Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation 
> in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- 
> which in addition to all dim1-2-3, produces aggregations for just dim1, just 
> dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient 
> cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, 
> and work up from there.
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to