XComp commented on a change in pull request #18910:
URL: https://github.com/apache/flink/pull/18910#discussion_r815822695
##########
File path: docs/content/docs/internals/job_scheduling.md
##########
@@ -95,3 +95,22 @@ For that reason, the execution of an ExecutionVertex is
tracked in an {{< gh_lin
{{< img src="/fig/state_machine.svg" alt="States and Transitions of Task
Executions" width="50%" >}}
{{< top >}}
+
+## Repeatable Resource Cleanup Strategy
+
+Once a job has reached a terminally global state of either finished, failed or
cancelled, the
+resources associated with the job are then cleaned up. This is done via a
repeatable
+resource cleanup strategy, in which failures to clean up a resource result
retries separated by
+an exponentially increasing delay, within configured bounds.
+
+{{< img src="/fig/repeatable_cleanup.png" alt="Repeatable resource cleanup
across a failover event" >}}
+
+Determining which jobs are globally terminated but still need to be cleaned up
across
+a failover event is done by determining whether an entry for the job exists in
the JobResultStore
+and whether that entry is dirty (and thus needs resource cleanup) or clean
(and thus does not
+need any further cleanup).
+
+The repeatable resource cleanup strategy has sensible defaults for the minimum
and maximum
Review comment:
After PR #18913 has been approved, you can use the information from the
branch to update the documentation, I guess.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]