[jira] [Commented] (PIG-4775) Better default values for shuffle bytes per reducer

Daniel Dai (JIRA) Sun, 10 Jan 2016 18:35:54 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091345#comment-15091345
 ]


Daniel Dai commented on PIG-4775:
---------------------------------

384 * 1024 * 1024L and 256 * 1024 * 1024L should be predefined constants 
(intermediateTaskInputSize too, but that's not part of the patch). Otherwise +1.

> Better default values for shuffle bytes per reducer
> ---------------------------------------------------
>
>                 Key: PIG-4775
>                 URL: https://issues.apache.org/jira/browse/PIG-4775
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>         Attachments: PIG-4775-1.patch
>
>
> Currently the code does not set 
> TEZ_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE if BYTES_PER_REDUCER_PARAM 
> is not set or equal to DEFAULT_BYTES_PER_REDUCER (1G). Which makes it default 
> to TEZ_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE_DEFAULT = 
> 1024*1024*100L (100MB) which is low and can cause to produce more output 
> files than usual. Removing that check and defaulting to 1G would be bad for 
> performance as in case of mapreduce that was based as map input size, but in 
> Tez it is taken as map output size. So setting 384MB as default for group by 
> as they usually reduce size of data output and keeping 256MB for joins as 
> they increase size of output data.
> Did not touch order by and skewed join as DEFAULT_BYTES_PER_REDUCER of 1G is 
> honored there. Using 1G for them would be similar to mapreduce, as map input 
> and output would be same for those cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4775) Better default values for shuffle bytes per reducer

Reply via email to