[ 
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435770#comment-13435770
 ] 

Prasanth J commented on PIG-2831:
---------------------------------

This is an initial patch of MR-Cube implementation for holistic measures. As of 
now this supports only COUNT + DISTINCT holistic measure. 
In overall, the way this patch works is when a CUBE operator is encountered in 
the logical plan the plan is modified to insert LOCube operator. This operator 
is attached with information about the clause of operation (CUBE/ROLLUP) and 
the corresponding dimensions. When the logical plan is compiled to physical 
plan, the determination of holistic measure happens. If holistic measure is 
detected then the information about the holistic measure and algebraic 
attribute is stored in POCube operator. Main changes happens when physical plan 
is compiled to MRPlan in MRCompiler. Following are the changes to the MRPlan
1) POCube visitor determines the sample size based on the estimated number of 
rows. A sample operator is inserted into the plan to get a sample dataset, upon 
which naive cubing is performed. The output of this MRJob is an annotated cube 
lattice with partition factors for each region.
2) Second MRJob receives the annotated lattice through distributed cache and 
will perform the actual full cube materialization. UDFs for partitioning large 
groups and performing post processing of output tuples are attached to this 
plan. 
3) Final MRJob is post aggregation job in which the output (measure values) of 
groups that are spread across multiple reducers are aggregated together to get 
a final result.

There are some issues/improvements that I have mentioned in the patch as FIXME 
and TODOs that require suggestions. 
One issue which I would like to point out here is about finding the actual 
tuple size (NOT in-memory size).
For finding the actual tuple size, I wrote a ReadSingleLoader which reads one 
tuple from the dataset and gets the number of bytes read from PigStorage. Using 
the number of bytes read and overall input data size I am finding the 
approximate number of rows in the dataset. One issue with this approach is that 
if the schema contains a variable datatype like chararray or bytearray then the 
estimation will have high error rate. To get a more accurate estimate of number 
of rows for variable datatypes we need to know the underlying distribution of 
the column and load n number of tuples based on the distribution. It will be 
helpful if someone can share thoughts about a better way of finding/estimating 
the number of rows.

Review request for this patch: https://reviews.apache.org/r/6651/
                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on 
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few 
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm 
> and generates annotated cube lattice (contains large group partitioning 
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using 
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of 
> actual cube materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to