Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by PiSong:
http://wiki.apache.org/pig/PigOLAPSketch

New page:
= Pig OLAP Support =

== Background ==
Online analytical data processing is a powerful method for business data 
analysis. However, the preprocessing of full OLAP cubes involves 
multiplications and aggregations of large volume of data, thus being 
computationally expensive in nature. By allowing the process to be executed on 
grid data processing framework such as MapReduce, this data analysis technique 
can be more accessible for most business users.

== Goals ==
 * We will allow population of full OLAP cubes
 * We will allow tabular projection operations (1D - as Bag, 2D - as crosstab 
Bag)

The projection operator will be provided just for experimenting with data at 
this stage. Pig currently doesn't have indexing infrastructure support so it 
can be too slow in practice. My primary aim is to use Pig for cube processing 
then export to external system/frontend for browsing.

== Implementation ==
Cube generation algorithm will be based on paper from Microsoft Research 
http://paul.rutgers.edu/~aminabdu/cs541/cube_op.pdf but converted to MapReduce 
version.
We only work on flattened star, snowflake schema as generally do.

==== Map Phase ====

Map:( k1 , v1 ) -> list( k2 , v2 )

{{{
Map Input: ((x1,x2,x3, ..., xn), (measure1, measure2))

Map Output: 

Cn,1
((MAGICKEY,x2,x3, ..., xn), (measure1, measure2)1)
((x1,MAGICKEY,x3, ..., xn), (measure1, measure2)1)
...
((x1, x2, x3, ..., MAGICKEY), (measure1, measure2)1)


Cn,2
((MAGICKEY,MAGICKEY,x3, ..., xn-1, xn), (measure1, measure2)1)
((MAGICKEY, x2, MAGICKEY, ..., xn-1, xn), (measure1, measure2)1)
...
((x1, x2, x3, ...,MAGICKEY, MAGICKEY), (measure1, measure2)1)

...

Cn,n-1
}}}
There will be 2^n-1 output records from an input

where n = number of dimensions

==== Reduce Phase ====

Reduce: ( k2 , list ( v2 )) -> list( v2 )

Basically just apply aggregate function(s) on all the measures. The logical 
view of output (and obviously also a point in hypercube) will look like:-

|| ||aggregate1||aggregate2||aggregate3||
||measure1||output1||output2||output3||
||measure2||output4||output5||output6||
||measure3||output7||output8||output9||


== Language Design ==

We will allow the preprocessing of full OLAP cube. This will result in a new 
type called "Cube"

Analytical cube creation syntax

{{{
Name = CUBE Name
       Dimension (ColumnName (,ColumnName)* )
       MEASURE (ColumnName (,ColumnName)* )
       AGGREGATE (FuncName (,FuncName)* )
       [ MAGICKEY (Number) ]?
}}}

Neglecting "MAGICKEY" clause will result in automatically picked MAGICKEY (will 
be hardcoded)

Sample:- 
{{{
Y = CUBE X
    DIMENSION (D1, D2, D3)
    MEASURE   (M1,M2)
    AGGREGATE (SUM, AVG, UDF1)
    MAGICKEY (2.718281828459045) ;
}}}

=== Cube browsing syntax ===

Only 1D and 2D can be displayed and understood easily by normal users so we are 
only interested in 1D projection, and 2D pivoting of hypercubes.

==== Normal Projection ====
We model 1D cube browsing as a process of traversing a 
dimension/dimension-member tree

For example, 

D1 = {Sydney, Melbourne, Brisbane}

D2 = { Pi, Jon, Lap }

D3 = { Jan, Feb }


{{{
             Root
           /   |  \
         D1   D2   D3       *
       /  | \
     Syd Mel Bris
     / \
    D2  D3                  *
   / | \ 
 Pi Jon Lap
     |
     D1                     *
    /  \
   Jan Feb
}}}
''Note'': Only levels with "*" can be viewed

/D1 would give:-
||Syd      ||1000||
||Mel      ||1200||
||Bris     ||1500||

/D1/Syd/D2 would give:-
(Under D1=Syd)
||Pi   || 100||
||Jon  || 200||
||Lap  || 300||



So, the syntax might look like:-

{{{
Z = CUBEVIEW X AT /D1/Syd/D2 ;
}}}

where X is a cube

This command will result in a bag containing the aggregated data

==== Pivoting ====
Basically we just want to be able to define two dimensions for pivoting. 
Dimensions appeared up in the dimension tree are not available in the lower 
level.

{{{
Z = CUBEVIEW X AT /D1/Syd/(D2,D3) ;
}}}

Output will look like this (in Bag):-

(Under D1=Syd)
||||     Jan||  Feb||
||Pi||    50||  50||
||Jon||   150|| 50||
||Lap||   100|| 200||


== Limitations ==
Initially, there will be no support for:-
 * Non-discrete dimension support. This will require histogram generation 
support in Pig.
 * Hierarchical dimension. Hierarchical dimension that forms a lattice.
 * No indexing support on cube browsing

Reply via email to