[
https://issues.apache.org/jira/browse/FLINK-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aihua Li closed FLINK-18115.
----------------------------
Resolution: Done
I mainly ran the stability test developed by Ali: by simulating online abnormal
conditions (such as network interruption, full disk, JM/AM process being
killed, TM throwing exception, etc.) to check whether flink operation can be
automatically recovered. The test lasted 5 hours, simulated multiple abnormal
combination scenarios, flink job can return to normal, and the checkpoint can
be created. The test pass
> Manually test fault-tolerance stability on Flink 1.11
> -----------------------------------------------------
>
> Key: FLINK-18115
> URL: https://issues.apache.org/jira/browse/FLINK-18115
> Project: Flink
> Issue Type: Sub-task
> Components: API / Core, API / State Processor, Build System, Client
> / Job Submission
> Affects Versions: 1.11.0
> Reporter: Aihua Li
> Assignee: Aihua Li
> Priority: Blocker
> Labels: release-testing
> Fix For: 1.11.0
>
>
> It mainly checks the flink job can recover from various unabnormal
> situations including disk full, network interruption, zk unable to connect,
> rpc message timeout, etc.
> If job can't be recoverd it means test failed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)