[
https://issues.apache.org/jira/browse/PIG-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427021#comment-13427021
]
Prasanth J commented on PIG-2765:
---------------------------------
RELEASE NOTES
*Introduction*
*Data cube*
In Online Analytical Processing Systems (OLAP), a data cube is a way of
organizing and visualizing multi-dimensional data to get useful insights over
some measure of interest. Measure is a term for numerical facts that are
analyzed against various dimensions of data. Based on the type of aggregations
performed, measures can be algebraic (like SUM, COUNT, MAX etc.) or holistic
(TOP-K, MEDIAN etc.). The following articles on wikipedia provides good
illustrations for data cubes
http://en.wikipedia.org/wiki/Online_analytical_processing
http://en.wikipedia.org/wiki/OLAP_cube
*Cube operation*
For a specified set of dimensions, cube operations computes aggregates for all
possbile combinations of group by dimensions. For example, CUBE(product,
location) will generate {(product,location), (product,null), (null,location),
(null,null)} combinations for all tuples in product and location dimensions,
where (null, null) represents the grand total.
The number of group by combinations generated by cube for n dimensions will be
2^n.
*Rollup operation*
For a specified set of dimensions, rollup operations computes multiple levels
of aggregates based on hierarchical ordering. For example, ROLLUP(region,state,
city) will generate {(region, state, city), (region, state, null), (region,
null, null), (null, null, null)} combinations for all tuples in region, state
and city dimensions. It computes aggregates from the most-details level
(region, state, city) to the grand total (null, null, null). Rollup is useful
when there is hierarchical ordering on the dimensions.
The number of group by combinations generated by rollup for n dimensions will
be n+1.
*Cubing in Pig*
Pig supports CUBE operator which can be used to perform cube or rollup
operations. The patches (PIG-2710, PIG-2765) adds syntactic sugar to the
existing builtin CubeDimensions and RollupDimensions UDF. With this new
addition, aggregations across multiple dimensions can be easily represented
using CUBE operator. Following example illustrates the CUBE operator usage:
*Syntax*
{code}outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [
CUBE expression | ROLLUP expression ] [PARALLEL n];{code}
outalias - The name of the output relation
inalias - The name of the input relation
CUBE, BY, ROLLUP - keywords
expression - projections (dimensions) of the relation.
PARALLEL n - Increase the parallelism of a job by specifying the number of
reduce tasks, n. The default value for n is 1 (one reduce task).
*Basic usage of CUBE operation*
{code}salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as
(product:chararray, year:int, region:chararray, state:chararray,
city:chararray, sales:long);
cubedinp = CUBE salesinp BY CUBE(product,year);
result = FOREACH cubedinp GENERATE FLATTEN(group), SUM(cube.sales) as
totalsales;
{code}
*Sample output*
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above
query with cube operation will output
(car,2012,4000)
(car,,4000)
(,2012,4000)
(,,4000)
*Output schema*
{code}grunt> describe cubedinp;
cubedinp: {group: (product: chararray,year: int),cube: {(product:
chararray,year: int,region: chararray,state: chararray,city: chararray,sales:
long)}}
{code}
Note the second column, ‘cube’ field which is a bag of all tuples that belong
to ‘group’. Also note that the measure attribute ‘sales’ along with other
unused dimensions in load statement are pushed down so that it can be
referenced later while computing aggregates on the measure, like in this case
SUM(cube.sales).
*Basic usage of ROLLUP operation*
{code}salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as
(product:chararray, year:int, region:chararray, state:chararray,
city:chararray, sales:long);
rolledup = CUBE salesinp BY ROLLUP(region,state,city);
result = FOREACH rolledup GENERATE FLATTEN(group), SUM(cube.sales) as
totalsales;
{code}
*Sample output*
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above
query with rollup operation will output
(midwest,ohio,columbus,4000)
(midwest,ohio,,4000)
(midwest,,,4000)
(,,,4000)
*Output schema*
{code}grunt> describe rolledup;
rolledup: {group: (region: chararray,state: chararray,city: chararray),cube:
{(region: chararray,state: chararray,city: chararray,product: chararray,year:
int,sales: long)}}
{code}
*Basic usage of CUBE and ROLLUP operation combined*
If CUBE and ROLLUP operations are used together, the output groups will be the
cross product of all groups generated by cube and rollup operation. If there
are m dimensions in cube operations and n dimensions in rollup operation then
overall number of combinations will be (2^m) * (n+1).
{code}salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as
(product:chararray, year:int, region:chararray, state:chararray,
city:chararray, sales:long);
cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state,
city);
result = FOREACH cubed_and_rolled GENERATE FLATTEN(group), SUM(cube.sales) as
totalsales;
{code}
*Sample output*
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above
query with cube and rollup operation will output
(car,2012,midwest,ohio,columbus,4000)
(car,2012,midwest,ohio,,4000)
(car,2012,midwest,,,4000)
(car,2012,,,,4000)
(car,,midwest,ohio,columbus,4000)
(car,,midwest,ohio,,4000)
(car,,midwest,,,4000)
(car,,,,,4000)
(,2012,midwest,ohio,columbus,4000)
(,2012,midwest,ohio,,4000)
(,2012,midwest,,,4000)
(,2012,,,,4000)
(,,midwest,ohio,columbus,4000)
(,,midwest,ohio,,4000)
(,,midwest,,,4000)
(,,,,,4000)
*Output schema*
{code}grunt> describe cubed_and_rolled;
cubed_and_rolled: {group: (product: chararray,year: int,region:
chararray,state: chararray,city: chararray),cube: {(product: chararray,year:
int,region: chararray,state: chararray,city: chararray,sales: long)}}
{code}
*Handling null values in dimensions*
Since null values are used to represent subtotals in cube and rollup operation,
in order to differentiate the legitimate null values that already exists as
dimension values, CUBE operator converts any null values in dimensions to
"unknown" value before performing cube or rollup operation. For example, for
CUBE(product,location) with a sample tuple (car,null) the output will be
{(car,unknown), (car,null), (null,unknown), (null,null)}.
> Implementing RollupDimensions UDF and adding ROLLUP clause in CUBE operator
> ---------------------------------------------------------------------------
>
> Key: PIG-2765
> URL: https://issues.apache.org/jira/browse/PIG-2765
> Project: Pig
> Issue Type: Sub-task
> Reporter: Prasanth J
> Assignee: Prasanth J
> Fix For: 0.11
>
> Attachments: PIG-2765.1.patch, PIG-2765.2.git.patch, PIG-2765.2.patch
>
>
> Implement RollupDimensions UDF which performs aggregation from most detailed
> level of dimensions to the most general level (grand total) in hierarchical
> order. Provide support for ROLLUP clause in CUBE operator.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira