[ 
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436998#comment-13436998
 ] 

Dmitriy V. Ryaboy commented on PIG-2831:
----------------------------------------

Prasanth, I'll go through with more details, but the biggest issue I see is the 
one you pointed out -- code for figuring out tuple size. Not only is the method 
of reading a single tuple unreliable, the method is not generally applicable, 
and we really don't want to tie this whole thing to PigStorage.

The reason you are getting raw tuple size is to estimate the total number of 
tuples. One way to achieve this is to check if the loader implements 
LoadMetadata, and try to get the number of tuples from provided stats if it 
does. That should be the primary method of determining the tuple size, as it 
will allow individual storage implementations to supply their own method for 
this approximation, and give us all the benefits of HCatalog work when that 
comes around.

In the meantime, we still have a problem with PigStorage.. how do SkewedJoin 
and Order currently get around this problem? My understanding is that they 
force the preceding data to be written to disk, then run a sampler job, and use 
*memory* estimates to determine how many reducers sampled keys need to go to. 
Can we not use the same approach here?


                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, 
> PIG-2831.3.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on 
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few 
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm 
> and generates annotated cube lattice (contains large group partitioning 
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using 
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of 
> actual cube materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to