XComp opened a new pull request #15090:
URL: https://github.com/apache/flink/pull/15090


   ## What is the purpose of the change
   
   The test had two problems:
   1. The parallelism of the job exceeded the available slots which caused a 
resource timeout for every job run
   2. There's a know race condition between the `ResourceManager` cleaning up 
the requirements in the `DefaultDeclarativeSlotPool` while freeing the finished 
job's resources and the corresponding `TaskExecutor` freeing its tasks as part 
of the job cleanup.
   
   ## Brief change log
   
   * The job's parallelism was lowered.
   * A new parameter `taskmanager.slot.timeout` is introduced that makes the 
time a slot becomes inactive configurable independently from the rpc timeout 
which was used before.
   * The new parameter is used in `AdaptiveSchedulerSlotSharingITCase` to break 
the race condition between the two cleanup mechanisms. The slot is freed faster 
now If the it was accidentally allocated again for the finished job due to the 
`TaskExecutor` cleaning up faster than the `ResourceManager`.
   
   ## Verifying this  change
   
   We looped over the test where it failed consistently before the change. The 
[AzureCI 
run](https://dev.azure.com/mapohl/flink/_build/results?buildId=307&view=results)
 failed due to no error being caught anymore.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? docs
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to