StephanEwen opened a new pull request #15557:
URL: https://github.com/apache/flink/pull/15557


   ## Purpose of this PR
   
   This is the first part of the fix for 
[FLINK-21996](https://issues.apache.org/jira/browse/FLINK-21996).
   
   ### (1) Refactoring of Number Sequence Source
   
   The initial commits refactor the Number Sequence Source to increase the 
flexibility.
   The source was previously very inflexible any could only deal with one split 
per parallel reader.
   To reduce future maintenance for the tests, I wanted to reuse this existing 
source and thus needed to increase the flexibility a bit (which is also good 
for that source in any case).
   
   ### (2) Test for RPC message loss
   
   The big commit 6360d65 introduces an Integration Test Case that simulates 
lost RPC messages and lost RPC acks for the events that assign source splits. 
As expected, this currently violates exactly-once semantics.
   
   The main test is in `OperatorEventSendingCheckpointITCase` and uses some 
tricks to inject an RPC gateway decorator to the MiniCluster, which can then be 
used to set various filters for the RPC calls. Some test methods are annotated 
with `@Ignore`, because they would detect the current shortcoming and fail. The 
`@Ignore` will be removed once the follow-up fix is in.
   
   ### (3) Operator Coordinator Event sending threading model
   
   Previously, events sent from the `OperatorCoordinator` to the tasks went 
synchronously to the Job Vertex and the latest `Execution` and the TaskManager 
RPC Gateway. However, this leads to hard-to-handle situations with races 
between event sending and task failures and recoveries. In extreme cases, an 
the Coordinator's thread could trigger an event to be send, and before the 
event reached the `Execution` a failure and recovery happened, meaning the 
event went to a later `Execution` than intended.
   
   Commit eea2bac changes the threading model so that event to be sent from an 
OperatorCoordinator to a task are now passed into the Scheduler Main Thread and 
sent from there. Because we need to preserve the order of events and 
checkpoints, the checkpoint completion is also passed into the main thread now.
   
   This does not fully fix the above-mentioned possible race, but it simplifies 
parts to make the next fix easier.
   
   ## Verifying this change
   
     - This change is covered by existing tests and adds a new test.
     - To verify the `OperatorEventSendingCheckpointITCase`, remove the 
`@Ignore` annotation and see it fail: The program never completes because it 
waits for data that got lost with the lost RPC calls.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: **no**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? **not applicable**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to