[
https://issues.apache.org/jira/browse/FLINK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann closed FLINK-10288.
---------------------------------
Resolution: Abandoned
Closed for inactivity.
> Failover Strategy improvement
> -----------------------------
>
> Key: FLINK-10288
> URL: https://issues.apache.org/jira/browse/FLINK-10288
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: JIN SUN
> Assignee: ryantaocer
> Priority: Major
>
> Flink pays significant efforts to make Streaming Job fault tolerant. The
> checkpoint mechanism and exactly once semantics make Flink different than
> other systems. However, there are still some cases not been handled very
> well. Those cases can apply to both Streaming and Batch scenarios, and its
> orthogonal with current fault tolerant mechanism. Here is a summary of those
> cases:
> # Some failures are non-recoverable, such as a user error:
> DividebyZeroException. We shouldn't try to restart the task, as it will never
> succeed. The DivideByZeroException is just a simple case, those errors
> sometime are not easy to reproduce or predict, as it might be only triggered
> by specific input data, we shouldn’t retry for all user code exceptions.
> # There is no limit for task retry today, unless a SuppressRestartException
> was encountered, a task will keep on retrying until it succeeds. As mentioned
> above, we shouldn’t retry for some cases at all, and for the Exceptions we
> can retry, such as a network exception, should we have a retry limit? We need
> retry for any transient issue, but we also need to set a limit to avoid
> infinite retry and resource wasting. For Batch and Streaming workload, we
> might need different strategies.
> # There are some exceptions due to hardware issues, such as disk/network
> malfunction. when a task/TaskManager fail on this, we’d better detect and
> avoid to schedule to that machine next time.
> # If a task read from a blocking result partition, when its input is not
> available, we can ‘revoke’ the produce task, set the task fail and rerun the
> upstream task to regenerate data. the revoke can propagate up through the
> chain. In Spark, revoke is naturally support by lineage.
> To make fault tolerance easier, we need to keep deterministic behavior as
> much as possible. For user code, it’s not easy to control. However, for
> system related code, we can fix it. For example, we should at least make sure
> the different attempt of a same task to have the same inputs (we have a bug
> in current codebase (DataSourceTask) that cannot guarantee this). Note that
> this is track by [Flink-10205]
> Details see this proposal:
> [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)