[jira] [Commented] (PIG-4958) Tez autoparallelism estimation for order by is higher than mapreduce

Daniel Dai (JIRA) Fri, 05 Aug 2016 22:41:33 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410497#comment-15410497
 ]


Daniel Dai commented on PIG-4958:
---------------------------------

If we want to use counter, why not get NUM_RECORDS as well? Then we can remove 
NUMROWS_TUPLE_MARKER row and simplify the code.

On the other hand, we can also write a GetDiskNumRows instead of GetMemNumRows 
to estimate the serialized size. Though using counter is more accurate, we 
don't need to use DAGClientImpl and deal with the RM token when use 
GetDiskNumRows. The DAGClientImpl + RM token approach sounds a little scary to 
me. How's that sound?

> Tez autoparallelism estimation for order by is higher than mapreduce
> --------------------------------------------------------------------
>
>                 Key: PIG-4958
>                 URL: https://issues.apache.org/jira/browse/PIG-4958
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>         Attachments: PIG-4958-1.patch, PIG-4958-2.patch, 
> PIG-4958-withoutsecurity.patch
>
>
>   The input size is calculated from the size of the samples in memory. Size 
> in memory is usually 4x or more than the serialized size. Mapreduce estimates 
> the number of reducers based on serialized size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4958) Tez autoparallelism estimation for order by is higher than mapreduce

Reply via email to