[ 
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451663#comment-13451663
 ] 

Prasanth J commented on PIG-2831:
---------------------------------

Sorry. I should have posted it earlier. 
Following are the memory statistics for the actual full cube job with the new 
proxy bag vs the older approach running on 10 node cluster using CDH4 (with 
default settings)

*Older approach:*
Spilled Records - 130,616,769
Physical memory (bytes) snapshot - 7,162,691,584
Virtual memory (bytes) snapshot - 27,694,501,888
Total committed heap usage (bytes) - 4,517,134,336

*Proxy bag approach:*
Spilled Records - 130,616,769
Physical memory (bytes) snapshot - 6,429,990,912
Virtual memory (bytes) snapshot - 27,698,757,632
Total committed heap usage (bytes) - 3,681,222,656 *(~19% improvement)*

Tested it for 3 runs and we get approximately 19% improvement in heap usage. :) 

Damn, I totally forgot about the guava transform functions!! Will update the 
patch. Thanks Dmitriy for your quick code review :)
                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, 
> PIG-2831.3.git.patch, PIG-2831.4.git.patch, PIG-2831.5.git.patch, 
> PIG-2831.6.git.patch, PIG-2831.7.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on 
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few 
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm 
> and generates annotated cube lattice (contains large group partitioning 
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using 
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of 
> actual cube materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to