[ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435770#comment-13435770 ]
Prasanth J commented on PIG-2831: --------------------------------- This is an initial patch of MR-Cube implementation for holistic measures. As of now this supports only COUNT + DISTINCT holistic measure. In overall, the way this patch works is when a CUBE operator is encountered in the logical plan the plan is modified to insert LOCube operator. This operator is attached with information about the clause of operation (CUBE/ROLLUP) and the corresponding dimensions. When the logical plan is compiled to physical plan, the determination of holistic measure happens. If holistic measure is detected then the information about the holistic measure and algebraic attribute is stored in POCube operator. Main changes happens when physical plan is compiled to MRPlan in MRCompiler. Following are the changes to the MRPlan 1) POCube visitor determines the sample size based on the estimated number of rows. A sample operator is inserted into the plan to get a sample dataset, upon which naive cubing is performed. The output of this MRJob is an annotated cube lattice with partition factors for each region. 2) Second MRJob receives the annotated lattice through distributed cache and will perform the actual full cube materialization. UDFs for partitioning large groups and performing post processing of output tuples are attached to this plan. 3) Final MRJob is post aggregation job in which the output (measure values) of groups that are spread across multiple reducers are aggregated together to get a final result. There are some issues/improvements that I have mentioned in the patch as FIXME and TODOs that require suggestions. One issue which I would like to point out here is about finding the actual tuple size (NOT in-memory size). For finding the actual tuple size, I wrote a ReadSingleLoader which reads one tuple from the dataset and gets the number of bytes read from PigStorage. Using the number of bytes read and overall input data size I am finding the approximate number of rows in the dataset. One issue with this approach is that if the schema contains a variable datatype like chararray or bytearray then the estimation will have high error rate. To get a more accurate estimate of number of rows for variable datatypes we need to know the underlying distribution of the column and load n number of tuples based on the distribution. It will be helpful if someone can share thoughts about a better way of finding/estimating the number of rows. Review request for this patch: https://reviews.apache.org/r/6651/ > MR-Cube implementation (Distributed cubing for holistic measures) > ----------------------------------------------------------------- > > Key: PIG-2831 > URL: https://issues.apache.org/jira/browse/PIG-2831 > Project: Pig > Issue Type: Sub-task > Reporter: Prasanth J > Assignee: Prasanth J > Attachments: PIG-2831.1.git.patch > > > Implementing distributed cube materialization on holistic measure based on > MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. > Primary steps involved: > 1) Identify if the measure is holistic or not > 2) Determine algebraic attribute (can be detected automatically for few > cases, if automatic detection fails user should hint the algebraic attribute) > 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm > and generates annotated cube lattice (contains large group partitioning > information) > 4) Modify plan to distribute annotated cube lattice to all mappers using > distributed cache > 5) Execute actual cube materialization on full dataset > 6) Modify MRPlan to insert a post process job for combining the results of > actual cube materialization job > 7) OOM exception handling -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira