Xintong Song created FLINK-13555:
------------------------------------
Summary: Failures of slot requests requiring unfulfillable managed
memory should not be ignored.
Key: FLINK-13555
URL: https://issues.apache.org/jira/browse/FLINK-13555
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Xintong Song
Fix For: 1.9.0
Attachments: flink-unk-standalonesession-0-u-home.log,
flink-unk-taskexecutor-0-u-home.log
Currently, SlotPool ignores failures of requesting slots from ResourceManager
for all batch slot requests. The idea behind this is to allow batch slot
requests pending at SlotPool and waiting for other tasks to finish and release
slots. A slot request will be failed only if it is not fulfilled in its timeout.
However, there could be two kinds of request slots from RM failures.
# RM does not have available slots. All slots are in use at the moment. But
they might become available later when the currently running tasks finish.
# The slot request requires too many resources that can not be fulfilled by
any slot (available or not) in the cluster. The request is also not likely to
be fulfilled later.
For the 2nd kinds of failures, it doesn't make sense to wait for the timeout.
We should fail the job immediately, with proper error messages describing the
problem and suggesting the user to tune job or cluster configurations.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)