[jira] [Commented] (PIG-2765) Implementing RollupDimensions UDF and adding ROLLUP clause in CUBE operator

Prasanth J (JIRA) Wed, 01 Aug 2012 17:10:24 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427021#comment-13427021
 ]


Prasanth J commented on PIG-2765:
---------------------------------


RELEASE NOTES

*Introduction*
*Data cube*
In Online Analytical Processing Systems (OLAP), a data cube is a way of 
organizing and visualizing multi-dimensional data to get useful insights over 
some measure of interest. Measure is a term for numerical facts that are 
analyzed against various dimensions of data. Based on the type of aggregations 
performed, measures can be algebraic (like SUM, COUNT, MAX etc.) or holistic 
(TOP-K, MEDIAN etc.). The following articles on wikipedia provides good 
illustrations for data cubes
http://en.wikipedia.org/wiki/Online_analytical_processing
http://en.wikipedia.org/wiki/OLAP_cube

*Cube operation*
For a specified set of dimensions, cube operations computes aggregates for all 
possbile combinations of group by dimensions. For example, CUBE(product, 
location) will generate {(product,location), (product,null), (null,location), 
(null,null)} combinations for all tuples in product and location dimensions, 
where (null, null) represents the grand total. 
The number of group by combinations generated by cube for n dimensions will be 
2^n. 

*Rollup operation*
For a specified set of dimensions, rollup operations computes multiple levels 
of aggregates based on hierarchical ordering. For example, ROLLUP(region,state, 
city) will generate {(region, state, city), (region, state, null), (region, 
null, null), (null, null, null)} combinations for all tuples in region, state 
and city dimensions. It computes aggregates from the most-details level 
(region, state, city) to the grand total (null, null, null). Rollup is useful 
when there is hierarchical ordering on the dimensions. 
The number of group by combinations generated by rollup for n dimensions will 
be n+1.

*Cubing in Pig*
Pig supports CUBE operator which can be used to perform cube or rollup 
operations. The patches (PIG-2710, PIG-2765) adds syntactic sugar to the 
existing builtin CubeDimensions and RollupDimensions UDF. With this new 
addition, aggregations across multiple dimensions can be easily represented 
using CUBE operator. Following example illustrates the CUBE operator usage:

*Syntax*
{code}outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ 
CUBE expression | ROLLUP expression ] [PARALLEL n];{code}

outalias - The name of the output relation
inalias - The name of the input relation
CUBE, BY, ROLLUP - keywords
expression - projections (dimensions) of the relation. 
PARALLEL n - Increase the parallelism of a job by specifying the number of 
reduce tasks, n. The default value for n is 1 (one reduce task).

*Basic usage of CUBE operation*
{code}salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as 
(product:chararray, year:int, region:chararray, state:chararray, 
city:chararray, sales:long); 
cubedinp = CUBE salesinp BY CUBE(product,year);
result = FOREACH cubedinp GENERATE FLATTEN(group), SUM(cube.sales) as 
totalsales;
{code}

*Sample output*
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above 
query with cube operation will output
(car,2012,4000)
(car,,4000)
(,2012,4000)
(,,4000)

*Output schema*
{code}grunt> describe cubedinp;
cubedinp: {group: (product: chararray,year: int),cube: {(product: 
chararray,year: int,region: chararray,state: chararray,city: chararray,sales: 
long)}}
{code}

Note the second column, ‘cube’ field which is a bag of all tuples that belong 
to ‘group’. Also note that the measure attribute ‘sales’ along with other 
unused dimensions in load statement are pushed down so that it can be 
referenced later while computing aggregates on the measure, like in this case 
SUM(cube.sales). 

*Basic usage of ROLLUP operation*
{code}salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as 
(product:chararray, year:int, region:chararray, state:chararray, 
city:chararray, sales:long); 
rolledup = CUBE salesinp BY ROLLUP(region,state,city);
result = FOREACH rolledup GENERATE FLATTEN(group), SUM(cube.sales) as 
totalsales;
{code}

*Sample output*
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above 
query with rollup operation will output
(midwest,ohio,columbus,4000)
(midwest,ohio,,4000)
(midwest,,,4000)
(,,,4000)

*Output schema*
{code}grunt> describe rolledup;
rolledup: {group: (region: chararray,state: chararray,city: chararray),cube: 
{(region: chararray,state: chararray,city: chararray,product: chararray,year: 
int,sales: long)}}
{code}

*Basic usage of CUBE and ROLLUP operation combined*
If CUBE and ROLLUP operations are used together, the output groups will be the 
cross product of all groups generated by cube and rollup operation. If there 
are m dimensions in cube operations and n dimensions in rollup operation then 
overall number of combinations will be (2^m) * (n+1). 
{code}salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as 
(product:chararray, year:int, region:chararray, state:chararray, 
city:chararray, sales:long); 
cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, 
city);
result = FOREACH cubed_and_rolled GENERATE FLATTEN(group), SUM(cube.sales) as 
totalsales;
{code}

*Sample output*
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above 
query with cube and rollup operation will output
(car,2012,midwest,ohio,columbus,4000)
(car,2012,midwest,ohio,,4000)
(car,2012,midwest,,,4000)
(car,2012,,,,4000)
(car,,midwest,ohio,columbus,4000)
(car,,midwest,ohio,,4000)
(car,,midwest,,,4000)
(car,,,,,4000)
(,2012,midwest,ohio,columbus,4000)
(,2012,midwest,ohio,,4000)
(,2012,midwest,,,4000)
(,2012,,,,4000)
(,,midwest,ohio,columbus,4000)
(,,midwest,ohio,,4000)
(,,midwest,,,4000)
(,,,,,4000)

*Output schema*
{code}grunt> describe cubed_and_rolled;
cubed_and_rolled: {group: (product: chararray,year: int,region: 
chararray,state: chararray,city: chararray),cube: {(product: chararray,year: 
int,region: chararray,state: chararray,city: chararray,sales: long)}}
{code}

*Handling null values in dimensions*
Since null values are used to represent subtotals in cube and rollup operation, 
in order to differentiate the legitimate null values that already exists as 
dimension values, CUBE operator converts any null values in dimensions to 
"unknown" value before performing cube or rollup operation. For example, for 
CUBE(product,location) with a sample tuple (car,null) the output will be 
{(car,unknown), (car,null), (null,unknown), (null,null)}. 

                
> Implementing RollupDimensions UDF and adding ROLLUP clause in CUBE operator
> ---------------------------------------------------------------------------
>
>                 Key: PIG-2765
>                 URL: https://issues.apache.org/jira/browse/PIG-2765
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>             Fix For: 0.11
>
>         Attachments: PIG-2765.1.patch, PIG-2765.2.git.patch, PIG-2765.2.patch
>
>
> Implement RollupDimensions UDF which performs aggregation from most detailed 
> level of dimensions to the most general level (grand total) in hierarchical 
> order. Provide support for ROLLUP clause in CUBE operator. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2765) Implementing RollupDimensions UDF and adding ROLLUP clause in CUBE operator

Reply via email to