Rohini Palaniswamy created PIG-4775:
---------------------------------------
Summary: Better default values for shuffle bytes per reducer
Key: PIG-4775
URL: https://issues.apache.org/jira/browse/PIG-4775
Project: Pig
Issue Type: Bug
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
Fix For: 0.16.0
Currently the code does not set
TEZ_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE if BYTES_PER_REDUCER_PARAM
is not set or equal to DEFAULT_BYTES_PER_REDUCER (1G). Which makes it default
to TEZ_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE_DEFAULT = 1024*1024*100L
(100MB) which is low and can cause to produce more output files than usual.
Removing that check and defaulting to 1G would be bad for performance as in
case of mapreduce that was based as map input size, but in Tez it is taken as
map output size. So setting 384MB as default for group by as they usually
reduce size of data output and keeping 256MB for joins as they increase size of
output data.
Did not touch order by and skewed join as DEFAULT_BYTES_PER_REDUCER of 1G is
honored there. Using 1G for them would be similar to mapreduce, as map input
and output would be same for those cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)