[ 
https://issues.apache.org/jira/browse/FLINK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931611#comment-16931611
 ] 

Elias Levy commented on FLINK-12122:
------------------------------------

Till, that is a welcomed development.  I am surprised this issue has languished 
since the 1.5 days.  It makes it very difficult to run certain jobs in 
standalone clusters that are over-allocated to handle failover in case of TM 
node failure.  The uneven allocation of tasks results in Kafka consumer lag for 
a subset of partitions under many workloads.  We've that to modify our clusters 
to exactly match parallelism and number of slots, and use other mechanisms to 
handle failover when upgrading old jobs to 1.9.

> Spread out tasks evenly across all available registered TaskManagers
> --------------------------------------------------------------------
>
>                 Key: FLINK-12122
>                 URL: https://issues.apache.org/jira/browse/FLINK-12122
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>         Attachments: image-2019-05-21-12-28-29-538.png, 
> image-2019-05-21-13-02-50-251.png
>
>
> With Flip-6, we changed the default behaviour how slots are assigned to 
> {{TaskManages}}. Instead of evenly spreading it out over all registered 
> {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a 
> tendency to first fill up a TM before using another one. This is a regression 
> wrt the pre Flip-6 code.
> I suggest to change the behaviour so that we try to evenly distribute slots 
> across all available {{TaskManagers}} by considering how many of their slots 
> are already allocated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to