[ 
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447460#comment-13447460
 ] 

Prasanth J commented on PIG-2831:
---------------------------------

Hi Dmitriy,

I have implemented the new inter storage with statistics gathering and new 
sample loader as per your idea on RB. Attached is the new patch containing the 
following changes
1) Added new RichInterStorage which implements StoreMetadata and LoadMetadata 
interfaces for storing and loading the statistics of intermediate data. 
RichInterStorage uses RichRecordReader, RichInputFormat for reading 
intermediate data and RichRecordWriter, RichOutputFormat for storing 
intermediate data. RichRecordWriter and RichOutputFormat are the same as 
InterRecordWriter and InterOutputFormat. The main difference is with the 
RichRecordReader and RichInputFormat. The RichInputFormat wraps all the splits 
to one logical split so that only one mapper is used for loading sample 
dataset. 
2) CubeSampleLoader uses underlying RichRecordReader for getting random samples 
of data. RichRecordReader opens utmost 100 inner splits and chooses a random 
split while reading the tuple. 
3) Changes to PigOutputCommitter for storing statistics. Statistics are stored 
at the end of every commitTask(). Statistics are stored for each output 
partition. RichInterStorage takes care of loading all the statistics 
corresponding to different partitions and aggregating them together. Statistics 
stores the numberOfRows and avgInMemTupleSize for each partitions (only these 
two values are required for holistic cubing).

This patch is quite bigger mainly because most of the changes (at the logical 
layer) are due to an old formatting issue which I fixed in this patch. Sorry 
about that. 

I have also updated the patch in RB. Please review it and let me know your 
feedback. Also I have kept some of the issues opened in your earlier review 
comments which require some of your thoughts. 

                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, 
> PIG-2831.3.git.patch, PIG-2831.4.git.patch, PIG-2831.5.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on 
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few 
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm 
> and generates annotated cube lattice (contains large group partitioning 
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using 
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of 
> actual cube materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to