1u0 opened a new pull request #9131: [FLINK-12858][checkpointing] 
Stop-with-savepoint, workaround: fail whole job when savepoint is declined by a 
task
URL: https://github.com/apache/flink/pull/9131
 
 
   ## What is the purpose of the change
   
   This pull request is an attempt to address hanging Flink job when 
stop-with-savepoint fails due to decline of the savepoint by job's task. In 
such cases, the job manager would fail the whole execution graph (which may 
trigger a job restart).
   
   ## Brief change log
   
     - The `LegacyScheduler` is modified to track `CheckpointException`s in 
`stopWithSavepoint()` that originate from tasks and fails the execution graph 
for such exceptions.
   
   ## Verifying this change
   
   This change was partially tested by manual e2e test:
    * configured Flink cluster with savepoints/checkpoints setup and 
`task.checkpoint.alignment.max-size: 64` set;
    * a test Flink job with two congested sources (joined in `map-1`) and 
events that exceed the configured limit (see execution graph below).
   
   In the test job, the `map-1` with high probability rejects 
checkpoint/savepoints due to `checkpointSizeLimitExceeded`.
   
   
![job-graph](https://user-images.githubusercontent.com/488251/61300624-9b368600-a7e2-11e9-8d9c-62e936689a79.png)
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to