[GitHub] [flink] XComp commented on a change in pull request #18910: [FLINK-26296] Add missing documentation for JobResultStore and repeatable resource cleanup strategy

GitBox Fri, 25 Feb 2022 05:23:25 -0800


XComp commented on a change in pull request #18910:
URL: https://github.com/apache/flink/pull/18910#discussion_r814754026




##########
File path: docs/content/docs/internals/job_scheduling.md
##########
@@ -95,3 +95,22 @@ For that reason, the execution of an ExecutionVertex is 
tracked in an {{< gh_lin
 {{< img src="/fig/state_machine.svg" alt="States and Transitions of Task 
Executions" width="50%" >}}
 
 {{< top >}}
+
+## Repeatable Resource Cleanup Strategy

Review comment:
       I guess, we concluded (when talking with @knaufk ) that we would move 
this part into `Deployment / Overview` before `Vendor Solutions`. But looking 
into the docs again: I feel like we could put it right before the `Deployment 
Modes` section and connect it semantically with the `External Components` table 
entries (because cleanup means cleaning up all external components). WDYT?

##########
File path: docs/content/docs/internals/job_scheduling.md
##########
@@ -95,3 +95,22 @@ For that reason, the execution of an ExecutionVertex is 
tracked in an {{< gh_lin
 {{< img src="/fig/state_machine.svg" alt="States and Transitions of Task 
Executions" width="50%" >}}
 
 {{< top >}}
+
+## Repeatable Resource Cleanup Strategy
+
+Once a job has reached a terminally global state of either finished, failed or 
cancelled, the
+resources associated with the job are then cleaned up. This is done via a 
repeatable
+resource cleanup strategy, in which failures to clean up a resource result 
retries separated by
+an exponentially increasing delay, within configured bounds.
+
+{{< img src="/fig/repeatable_cleanup.png" alt="Repeatable resource cleanup 
across a failover event" >}}
+
+Determining which jobs are globally terminated but still need to be cleaned up 
across
+a failover event is done by determining whether an entry for the job exists in 
the JobResultStore
+and whether that entry is dirty (and thus needs resource cleanup) or clean 
(and thus does not
+need any further cleanup).
+
+The repeatable resource cleanup strategy has sensible defaults for the minimum 
and maximum

Review comment:
       Good :-) This just changed a bit due to PR #18913 

##########
File path: docs/content/docs/internals/job_scheduling.md
##########
@@ -95,3 +95,22 @@ For that reason, the execution of an ExecutionVertex is 
tracked in an {{< gh_lin
 {{< img src="/fig/state_machine.svg" alt="States and Transitions of Task 
Executions" width="50%" >}}
 
 {{< top >}}
+
+## Repeatable Resource Cleanup Strategy
+
+Once a job has reached a terminally global state of either finished, failed or 
cancelled, the

Review comment:
       ```suggestion
   Once a job has reached a globally terminal state of either finished, failed 
or cancelled, the
   ```
   "globally terminal" is the usual phrasing as far as I remember

##########
File path: docs/content/docs/internals/job_scheduling.md
##########
@@ -95,3 +95,22 @@ For that reason, the execution of an ExecutionVertex is 
tracked in an {{< gh_lin
 {{< img src="/fig/state_machine.svg" alt="States and Transitions of Task 
Executions" width="50%" >}}
 
 {{< top >}}
+
+## Repeatable Resource Cleanup Strategy
+
+Once a job has reached a terminally global state of either finished, failed or 
cancelled, the
+resources associated with the job are then cleaned up. This is done via a 
repeatable
+resource cleanup strategy, in which failures to clean up a resource result 
retries separated by
+an exponentially increasing delay, within configured bounds.

Review comment:
       ```
   This is done via a repeatable
   resource cleanup strategy, in which failures to clean up a resource result 
retries separated by
   an exponentially increasing delay, within configured bounds.
   ```
   I struggle to understand that sentence. 🤔  What about: "In case of cleanup 
failure, you are able to configure different ways of retrying the cleanup." 
(addressing the user directly seems to be desired in the documentation 
according to @infoverload 

##########
File path: docs/content/docs/internals/job_scheduling.md
##########
@@ -95,3 +95,22 @@ For that reason, the execution of an ExecutionVertex is 
tracked in an {{< gh_lin
 {{< img src="/fig/state_machine.svg" alt="States and Transitions of Task 
Executions" width="50%" >}}
 
 {{< top >}}
+
+## Repeatable Resource Cleanup Strategy
+
+Once a job has reached a terminally global state of either finished, failed or 
cancelled, the
+resources associated with the job are then cleaned up. This is done via a 
repeatable
+resource cleanup strategy, in which failures to clean up a resource result 
retries separated by
+an exponentially increasing delay, within configured bounds.
+
+{{< img src="/fig/repeatable_cleanup.png" alt="Repeatable resource cleanup 
across a failover event" >}}

Review comment:
       I like that you integrated the picture. :-) I'm just wondering whether 
that's too detailed for the documentation. We're not talking about the 
individual components here (or in `Deployments / Overview`). Hence, I was 
wondering whether we should simplify it and hide the details on what components 
are handled internally. That would also have the benefit if we're changing 
something in terms of what components are subject to cleanup. In that case, we 
wouldn't need to change this picture. 🤔 

##########
File path: docs/content/docs/deployment/ha/overview.md
##########
@@ -79,3 +79,17 @@ The HA data will be kept until the respective job either 
succeeds, is cancelled
 Once this happens, all the HA data, including the metadata stored in the HA 
services, will be deleted.  
 
 {{< top >}}
+
+## JobResultStore
+
+In order to preserve a job's scheduling status across failover events and 
prevent erroneous
+re-execution of globally terminated (finished, cancelled or failed) jobs, 
Flink persists

Review comment:
       ```suggestion
   re-execution of globally terminated (i.e. finished, cancelled or failed) 
jobs, Flink persists
   ```
   nit

##########
File path: docs/content/docs/internals/job_scheduling.md
##########
@@ -95,3 +95,22 @@ For that reason, the execution of an ExecutionVertex is 
tracked in an {{< gh_lin
 {{< img src="/fig/state_machine.svg" alt="States and Transitions of Task 
Executions" width="50%" >}}
 
 {{< top >}}
+
+## Repeatable Resource Cleanup Strategy
+
+Once a job has reached a terminally global state of either finished, failed or 
cancelled, the
+resources associated with the job are then cleaned up. This is done via a 
repeatable
+resource cleanup strategy, in which failures to clean up a resource result 
retries separated by
+an exponentially increasing delay, within configured bounds.
+
+{{< img src="/fig/repeatable_cleanup.png" alt="Repeatable resource cleanup 
across a failover event" >}}
+
+Determining which jobs are globally terminated but still need to be cleaned up 
across
+a failover event is done by determining whether an entry for the job exists in 
the JobResultStore
+and whether that entry is dirty (and thus needs resource cleanup) or clean 
(and thus does not
+need any further cleanup).

Review comment:
       Hm, initially I thought it's a good idea and we should cross-reference 
the JobResultStore docs here. But then again, I was wondering whether we 
should, because the `JobResultStore` is mostly relevant in HA mode. But the 
retries would also work in non-HA mode and might confuse users in that case.
   
   Hence, I was wondering whether we should handle this paragraph as not 
relevant for the user.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] XComp commented on a change in pull request #18910: [FLINK-26296] Add missing documentation for JobResultStore and repeatable resource cleanup strategy

Reply via email to