Rohini Palaniswamy created PIG-4775:
---------------------------------------

             Summary: Better default values for shuffle bytes per reducer
                 Key: PIG-4775
                 URL: https://issues.apache.org/jira/browse/PIG-4775
             Project: Pig
          Issue Type: Bug
            Reporter: Rohini Palaniswamy
            Assignee: Rohini Palaniswamy
             Fix For: 0.16.0


Currently the code does not set 
TEZ_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE if BYTES_PER_REDUCER_PARAM 
is not set or equal to DEFAULT_BYTES_PER_REDUCER (1G). Which makes it default 
to TEZ_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE_DEFAULT = 1024*1024*100L 
(100MB) which is low and can cause to produce more output files than usual. 
Removing that check and defaulting to 1G would be bad for performance as in 
case of mapreduce that was based as map input size, but in Tez it is taken as 
map output size. So setting 384MB as default for group by as they usually 
reduce size of data output and keeping 256MB for joins as they increase size of 
output data.

Did not touch order by and skewed join as DEFAULT_BYTES_PER_REDUCER of 1G is 
honored there. Using 1G for them would be similar to mapreduce, as map input 
and output would be same for those cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to