[ 
https://issues.apache.org/jira/browse/TEZ-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054099#comment-17054099
 ] 

László Bodor commented on TEZ-4130:
-----------------------------------

I agree [~jeagles]: it was confusing, so, the original issue in hive was about 
too many splits, which had effect on the number of output files written to the 
scratch dir, but it had nothing to do with shuffle vertex manager's parallelism 
parameter, so what we're trying to achieve here is some convenient way to limit 
the number of input splits (after grouping), and that can be achieved in some 
way as [~belugabehr] proposed above I think

{code}
2020-02-16 13:42:40,546 [INFO] [InputInitializer {Map 1} #0] 
|tez.HiveSplitGenerator|: The preferred split size is 16777216
...
2020-02-16 13:42:41,680 [INFO] [InputInitializer {Map 1} #0] 
|tez.HiveSplitGenerator|: Number of input splits: 4800. 22 available slots, 1.7 
waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat
...
2020-02-16 13:42:41,870 [INFO] [InputInitializer {Map 1} #0] 
|grouper.TezSplitGrouper|: Desired splits: 37 too small.  Desired splitLength: 
33272755689 Max splitLength: 268435456 New desired splits: 4587 Total length: 
1231091960503 Original splits: 4800
2020-02-16 13:42:41,885 [INFO] [InputInitializer {Map 1} #0] 
|grouper.TezSplitGrouper|: Desired numSplits: 4587 lengthPerGroup: 268387172 
numLocations: 49 numSplitsPerLocation: 97 numSplitsInGroup: 1 totalLength: 
1231091960503 numOriginalSplits: 4800 . Grouping by length: true count: false 
nodeLocalOnly: false
2020-02-16 13:42:41,902 [INFO] [InputInitializer {Map 1} #0] 
|grouper.TezSplitGrouper|: Number of splits desired: 4587 created: 4800 
splitsProcessed: 4800
2020-02-16 13:42:41,907 [INFO] [InputInitializer {Map 1} #0] 
|tez.SplitGrouper|: Original split count is 4800 grouped split count is 4800, 
for bucket: 1
2020-02-16 13:42:41,910 [INFO] [InputInitializer {Map 1} #0] 
|tez.HiveSplitGenerator|: Number of split groups: 4800
{code}


> Config for max task parallelism in shuffle - 
> tez.shuffle-vertex-manager.max-task-parallelism
> --------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4130
>                 URL: https://issues.apache.org/jira/browse/TEZ-4130
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>
> During the investigation of a customer issue, I found that tez generated a 
> dag plan containing >4k tasks. It failed for hive because of bucket number 
> limitations (4k). It can be configured properly, e.g. bigger splits 
> (tez.grouping.min-size), but maybe it would be more convenient for users to 
> config a hard limit for shuffle vertex manager.
> However, I'm not really sure if it's correct to force changing the max task 
> parallelism after split generation already happened (e.g. 
> [HiveSplitGenerator|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L192-L244]):
> https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/dag/library/vertexmanager/ShuffleVertexManager.java#L477



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to