[
https://issues.apache.org/jira/browse/FLINK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559773#comment-16559773
]
Piotr Nowojski commented on FLINK-9969:
---------------------------------------
Another remark. This might also have something to do with how Flip6 lazily
requests new task managers. In {{mode: old}} tasks are uniformly distributed
among all nodes, while in Flip6, tasks are squeezed mostly into one node. Time
line of events is as follow:
12:49:13 - requested new TaskManager
12:49:20 - first TaskManager registered
12:49:20 all tasks moved from `CREATED` to `SCHEDULED`, they start being
executed in batches of 4 (because of 4 slots) on this TM
12:49:20 - requested more TaskManagers
12:49:21 first TM after processing ~18 tasks fails with OOM
12:49:27 - more TM registered
Please check attached [^yarn_logs] for details.
> Unreasonable memory requirements to complete examples/batch/WordCount
> ---------------------------------------------------------------------
>
> Key: FLINK-9969
> URL: https://issues.apache.org/jira/browse/FLINK-9969
> Project: Flink
> Issue Type: Bug
> Components: ResourceManager
> Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0
> Reporter: Piotr Nowojski
> Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
> Attachments: yarn_logs
>
>
> setup on AWS EMR:
> * 5 worker nodes (m4.4xlarge nodes)
> * 1 master node (m4.large)
> following command fails with out of memory errors:
> {noformat}
> export HADOOP_CLASSPATH=`hadoop classpath`
> ./bin/flink run -m yarn-cluster -p 20 -yn 5 -ys 4 -ytm 16000
> examples/batch/WordCount.jar{noformat}
> Only increasing memory over 17.2GB example completes. At the same time after
> disabling flip6 following command succeeds:
> {noformat}
> export HADOOP_CLASSPATH=`hadoop classpath`
> ./bin/flink run -m yarn-cluster -p 20 -yn 5 -ys 4 -ytm 1000
> examples/batch/WordCount.jar{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)