JIN SUN created FLINK-10288:
-------------------------------

             Summary: Failover Strategy improvement
                 Key: FLINK-10288
                 URL: https://issues.apache.org/jira/browse/FLINK-10288
             Project: Flink
          Issue Type: Improvement
          Components: JobManager
            Reporter: JIN SUN
            Assignee: JIN SUN


Flink pays significant efforts to make Streaming Job fault tolerant. The 
checkpoint mechanism and exactly once semantics make Flink different than other 
systems. However, there are still some cases not been handled very well. Those 
cases can apply to both Streaming and Batch scenarios, and its orthogonal with 
current fault tolerant mechanism. Here is a summary of those cases:
 # Some failures are non-recoverable, such as a user error: 
DividebyZeroException. We shouldn't try to restart the task, as it will never 
succeed. The DivideByZeroException is just a simple case, those errors sometime 
are not easy to reproduce or predict, as it might be only triggered by specific 
input data, we shouldn’t retry for all user code exceptions.
 # There is no limit for task retry today, unless a SuppressRestartException 
was encountered, a task will keep on retrying until it succeeds. As mentioned 
above, we shouldn’t retry for some cases at all, and for the Exceptions we can 
retry, such as a network exception, should we have a retry limit? We need retry 
for any transient issue, but we also need to set a limit to avoid infinite 
retry and resource wasting. For Batch and Streaming workload, we might need 
different strategies.
 # There are some exceptions due to hardware issues, such as disk/network 
malfunction. when a task/TaskManager fail on this, we’d better detect and avoid 
to schedule to that machine next time.
 # If a task read from a blocking result partition, when its input is not 
available, we can ‘revoke’ the produce task, set the task fail and rerun the 
upstream task to regenerate data.  the revoke can propagate up through the 
chain. In Spark, revoke is naturally support by lineage.

To make fault tolerance easier, we need to keep deterministic behavior as much 
as possible. For user code, it’s not easy to control. However, for system 
related code, we can fix it. For example, we should at least make sure the 
different attempt of a same task to have the same inputs (we have a bug in 
current codebase (DataSourceTask) that cannot guarantee this). Note that this 
is track by [Flink-10205]

Details see this proposal:

[https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to