[ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437077#comment-13437077 ]
Prasanth J commented on PIG-2831: --------------------------------- Yes. It's true that skewed join and order by forces the data to be written to disk in a map-only job and then use PoissonSampleLoader/RandomSampleLoader resp. PoissonSampleLoader loads n tuples from the dataset based on the join key distribution and appends a special tuple at the end with the number of tuples loaded info. Whereas, RandomSampleLoader just uses 100 tuples to be loaded from each mapper. PoissonSampleLoader is definitely not applicable for our case. RandomSampleLoader can be used but we need to specify how many samples to load per mapper based on the overall datasize. I think this method will also be not reliable because it may lead to oversampling or undersampling. Also we need to know the number of mappers before specifying the #samples per mapper. One more disadvantage with this approach is the cost of one map-only job. This will be very expensive if the datasize is too big. I also noted that after the dataset is forcefully copied to disk the overall size gets increased because of InterStorage format. Performance wise I found the current approach of using SAMPLE operator to be much faster. The entire sample extraction happens within few mins (1 min 23s for ~100K samples from 100M tuples). Also this doesn't cost addition map job and saves space. I like the idea of using LoadMetadata approach but until we have HCatalog work integrated we may not be able to use it. > MR-Cube implementation (Distributed cubing for holistic measures) > ----------------------------------------------------------------- > > Key: PIG-2831 > URL: https://issues.apache.org/jira/browse/PIG-2831 > Project: Pig > Issue Type: Sub-task > Reporter: Prasanth J > Assignee: Prasanth J > Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, > PIG-2831.3.git.patch > > > Implementing distributed cube materialization on holistic measure based on > MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. > Primary steps involved: > 1) Identify if the measure is holistic or not > 2) Determine algebraic attribute (can be detected automatically for few > cases, if automatic detection fails user should hint the algebraic attribute) > 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm > and generates annotated cube lattice (contains large group partitioning > information) > 4) Modify plan to distribute annotated cube lattice to all mappers using > distributed cache > 5) Execute actual cube materialization on full dataset > 6) Modify MRPlan to insert a post process job for combining the results of > actual cube materialization job > 7) OOM exception handling -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira