[ 
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437077#comment-13437077
 ] 

Prasanth J commented on PIG-2831:
---------------------------------

Yes. It's true that skewed join and order by forces the data to be written to 
disk in a map-only job and then use PoissonSampleLoader/RandomSampleLoader 
resp. PoissonSampleLoader loads n tuples from the dataset based on the join key 
distribution and appends a special tuple at the end with the number of tuples 
loaded info. Whereas, RandomSampleLoader just uses 100 tuples to be loaded from 
each mapper. PoissonSampleLoader is definitely not applicable for our case. 
RandomSampleLoader can be used but we need to specify how many samples to load 
per mapper based on the overall datasize. I think this method will also be not 
reliable because it may lead to oversampling or undersampling. Also we need to 
know the number of mappers before specifying the #samples per mapper. One more 
disadvantage with this approach is the cost of one map-only job. This will be 
very expensive if the datasize is too big. I also noted that after the dataset 
is forcefully copied to disk the overall size gets increased because of 
InterStorage format. 

Performance wise I found the current approach of using SAMPLE operator to be 
much faster. The entire sample extraction happens within few mins (1 min 23s 
for ~100K samples from 100M tuples). Also this doesn't cost addition map job 
and saves space. 

I like the idea of using LoadMetadata approach but until we have HCatalog work 
integrated we may not be able to use it. 
                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, 
> PIG-2831.3.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on 
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few 
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm 
> and generates annotated cube lattice (contains large group partitioning 
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using 
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of 
> actual cube materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to