JIN SUN created FLINK-10288:
-------------------------------
Summary: Failover Strategy improvement
Key: FLINK-10288
URL: https://issues.apache.org/jira/browse/FLINK-10288
Project: Flink
Issue Type: Improvement
Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN
Flink pays significant efforts to make Streaming Job fault tolerant. The
checkpoint mechanism and exactly once semantics make Flink different than other
systems. However, there are still some cases not been handled very well. Those
cases can apply to both Streaming and Batch scenarios, and its orthogonal with
current fault tolerant mechanism. Here is a summary of those cases:
# Some failures are non-recoverable, such as a user error:
DividebyZeroException. We shouldn't try to restart the task, as it will never
succeed. The DivideByZeroException is just a simple case, those errors sometime
are not easy to reproduce or predict, as it might be only triggered by specific
input data, we shouldn’t retry for all user code exceptions.
# There is no limit for task retry today, unless a SuppressRestartException
was encountered, a task will keep on retrying until it succeeds. As mentioned
above, we shouldn’t retry for some cases at all, and for the Exceptions we can
retry, such as a network exception, should we have a retry limit? We need retry
for any transient issue, but we also need to set a limit to avoid infinite
retry and resource wasting. For Batch and Streaming workload, we might need
different strategies.
# There are some exceptions due to hardware issues, such as disk/network
malfunction. when a task/TaskManager fail on this, we’d better detect and avoid
to schedule to that machine next time.
# If a task read from a blocking result partition, when its input is not
available, we can ‘revoke’ the produce task, set the task fail and rerun the
upstream task to regenerate data. the revoke can propagate up through the
chain. In Spark, revoke is naturally support by lineage.
To make fault tolerance easier, we need to keep deterministic behavior as much
as possible. For user code, it’s not easy to control. However, for system
related code, we can fix it. For example, we should at least make sure the
different attempt of a same task to have the same inputs (we have a bug in
current codebase (DataSourceTask) that cannot guarantee this). Note that this
is track by [Flink-10205]
Details see this proposal:
[https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)