[ https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410497#comment-15410497 ]
Daniel Dai commented on PIG-4958: --------------------------------- If we want to use counter, why not get NUM_RECORDS as well? Then we can remove NUMROWS_TUPLE_MARKER row and simplify the code. On the other hand, we can also write a GetDiskNumRows instead of GetMemNumRows to estimate the serialized size. Though using counter is more accurate, we don't need to use DAGClientImpl and deal with the RM token when use GetDiskNumRows. The DAGClientImpl + RM token approach sounds a little scary to me. How's that sound? > Tez autoparallelism estimation for order by is higher than mapreduce > -------------------------------------------------------------------- > > Key: PIG-4958 > URL: https://issues.apache.org/jira/browse/PIG-4958 > Project: Pig > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Fix For: 0.17.0 > > Attachments: PIG-4958-1.patch, PIG-4958-2.patch, > PIG-4958-withoutsecurity.patch > > > The input size is calculated from the size of the samples in memory. Size > in memory is usually 4x or more than the serialized size. Mapreduce estimates > the number of reducers based on serialized size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)