Zhu Zhu created FLINK-12138:
-------------------------------
Summary: Limit input split count of each source task for better
failover experience
Key: FLINK-12138
URL: https://issues.apache.org/jira/browse/FLINK-12138
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Zhu Zhu
Assignee: Zhu Zhu
Flink currently use an InputSplitAssigner to dynamically assign input splits to
source tasks. A task requests a new split after finishes processing the
previous one. Thus to achieve a better load balance.
However, in cases that the slots are fewer than the source tasks, only the
first launched source tasks can request splits and it will last till all the
splits are consumed. This is not failover friendly, as users sometimes
intentionally set a larger parallelism to reduce the failover impact.
For example, a job runs in an 10 slots session and it has an 1000 parallelism
source vertex to consume 10000 splits, all vertices are not connected to
others. Currently, 10 of 1000 source task will be launched and will only finish
after all the input splits are consumed. If a task fails, at most ~1000 splits
need to be re-processed. While if 1000 tasks can run at once, only ~10 splits
needs to be re-processed.
We's propose add a cap for the input splits count that each source task shall
process. Once the cap is reached, the task cannot get any more split from the
InputSplitAssigner and finishes then. Thus slot space can be made for other
source tasks.
Theoretically, it would be proper to set the cap to be max(Input split
size)/avg(input split size).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)