[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529796#comment-16529796 ]
ASF GitHub Bot commented on FLINK-9004: --------------------------------------- GitHub user GJL opened a pull request: https://github.com/apache/flink/pull/6239 [FLINK-9004][tests] Implement Jepsen tests to test job availability. ## What is the purpose of the change *Use the Jepsen framework (https://github.com/jepsen-io/jepsen) to implement tests that verify Flink's HA capabilities under real-world faults, such as sudden TaskManager/JobManager termination, HDFS NameNode unavailability, network partitions, etc. The Flink cluster under test is automatically deployed on YARN (session & job mode) and Mesos. Provide Dockerfiles for local test development.* ## Brief change log - *Implement Jepsen tests.* ## Verifying this change This change added tests and can be verified as follows: - *The changes themselves are tests.* - *Run Jepsen tests in docker containers.* - *Run unit tests with `lein test`* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no** (at least not to Flink)) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** (but it will as soon as test failures appear) / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) cc: @tillrohrmann @cewood You can merge this pull request into a Git repository by running: $ git pull https://github.com/GJL/flink FLINK-9004 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6239.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6239 ---- commit 063e4621a5982b55ee7f7b0935290bbc717a5a45 Author: gyao <gary@...> Date: 2018-03-05T21:23:33Z [FLINK-9004][tests] Implement Jepsen tests to test job availability. Use the Jepsen framework (https://github.com/jepsen-io/jepsen) to implement tests that verify Flink's HA capabilities under real-world faults, such as sudden TaskManager/JobManager termination, HDFS NameNode unavailability, network partitions, etc. The Flink cluster under test is automatically deployed on YARN (session & job mode) and Mesos. Provide Dockerfiles for local test development. ---- > Cluster test: Run general purpose job with failures with Yarn session > --------------------------------------------------------------------- > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests > Affects Versions: 1.5.0 > Reporter: Till Rohrmann > Assignee: Gary Yao > Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)