Hong Liang Teoh created FLINK-37417:
---------------------------------------
Summary: Implement better exception message to explain why
KinesisSource doesn't support partial recovery
Key: FLINK-37417
URL: https://issues.apache.org/jira/browse/FLINK-37417
Project: Flink
Issue Type: Improvement
Reporter: Hong Liang Teoh
The KinesisSource doesn't currently support partial recovery because we want to
prevent duplicate records being read into KinesisSource when restart + partial
failure happens.
Partial recovery is when Flink restarts a subset of subtasks to minimize
downtime during a job failure. Flink determines which subtasks need to be
restarted by checking all connected subtasks to the failed subtasks.
However, this algorithm doesn't work for a KDS source, because the
KinesisSource:
# Has parent-child shard ordering within the KDS stream.
# Does not ensure that all child shards are assigned to the same subtask (to
prevent skew)
These two mean that when we restart "selected" subtasks, we cannot ensure that
the child shards have not been assigned to other subtasks "not connected" to
the selected subtask.
This is very much an edge case, where users will have to be doing the following:
# NOT have a keyBy immediately after the KDS source.
# Partial failure happens on an operator BEFORE the first keyBy in the job
graph.
This JIRA is to enrich the Exception message thrown
--
This message was sent by Atlassian Jira
(v8.20.10#820010)