[ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427036#comment-13427036 ]
Prasanth J commented on PIG-2831: --------------------------------- Ya. I could do that. Looks like this will need a separate map job for reading few tuples? I think this will require tweaking the loader to emit a special tuple with the estimated number of records. I will try that once the end to end base implementation is up. For now the way I am counting the sample size is by using RandomSampleLoader. I am sampling 1000 tuples per mapper and using that samples for naive computation and determining the partition size. But RandomSampleLoader always returns more samples than expected. Not sure if its a bug!!. Once the complete implementation is done we can look into more accurate estimate of number of tuples etc. Will soon submit an intermediate patch for review. Also given the in-memory size of a tuple, how can we estimate the number of tuples that a reducer can handle without spilling to disk? > MR-Cube implementation (Distributed cubing for holistic measures) > ----------------------------------------------------------------- > > Key: PIG-2831 > URL: https://issues.apache.org/jira/browse/PIG-2831 > Project: Pig > Issue Type: Sub-task > Reporter: Prasanth J > Assignee: Prasanth J > > Implementing distributed cube materialization on holistic measure based on > MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. > Primary steps involved: > 1) Identify if the measure is holistic or not > 2) Determine algebraic attribute (can be detected automatically for few > cases, if automatic detection fails user should hint the algebraic attribute) > 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm > and generates annotated cube lattice (contains large group partitioning > information) > 4) Modify plan to distribute annotated cube lattice to all mappers using > distributed cache > 5) Execute actual cube materialization on full dataset > 6) Modify MRPlan to insert a post process job for combining the results of > actual cube materialization job > 7) OOM exception handling -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira