[ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856581#comment-13856581 ]
Prasanth J commented on PIG-2167: --------------------------------- [~zhchbin] Thanks for your interest. The cube work in Pig had many parts to it, PIG-2167 adds the syntax for CUBE and ROLLUP operation. It also rewrites the logical plan to insert the cubing and rollup udfs. PIG-2167 is just a syntactic sugar over the UDFs. The main idea of the MRCube algorithm is to handle holistic measures (PIG-2831). PIG-2831 patch is way too outdated and is not reviewed yet. So there might be many rough edges. PIG-2831 does the following 1) Identify holistic measure and insert logical cube/rollup operators with some additional information 2) Identify algebraic attributes (partitioning attribute) 3) Inserts sampling job into MRPlan (added new sampling algorithm that reads N records randomly without reading the entire file) 4) Added new intermediate storage formats that can write/read statistics 5) Inserts post-processing job to aggregate the partitioned results All the above steps are required for value-partitioning algorithm for data distribution. It does not implement the batch-area identification (distributing the computation) as mentioned in the paper. It might be worthwhile to start off with value partitioning followed by batch areas and other optimizations like BUC etc. I would be more than happy to help you with any of these steps in my free time. > CUBE operation in Pig > --------------------- > > Key: PIG-2167 > URL: https://issues.apache.org/jira/browse/PIG-2167 > Project: Pig > Issue Type: New Feature > Reporter: Dmitriy V. Ryaboy > Assignee: Prasanth J > Labels: gsoc2012, mentor > Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, > PIG-2167.4.patch, Pig-Cubing-Performance.png > > > Computing aggregates over a cube of several dimensions is a common operation > in data warehousing. > The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- > which in addition to all dim1-2-3, produces aggregations for just dim1, just > dim1 and dim2, etc. NULL is generally used to represent "all". > A presentation by Arnab Nandi describes how one might implement efficient > cubing in Map-Reduce here: http://pdf.cx/44wrk > We can start with the naive solution which only works for algebraic measures, > and work up from there. > This is a candidate project for Google summer of code 2012. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message was sent by Atlassian JIRA (v6.1.5#6160)