[GitHub] [flink] Thesharing commented on a diff in pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

GitBox Thu, 07 Apr 2022 08:11:58 -0700


Thesharing commented on code in PR #19275:
URL: https://github.com/apache/flink/pull/19275#discussion_r845254295



##########
flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java:
##########
@@ -1103,23 +1096,64 @@ private void archiveExecutionGraph(ExecutionGraphInfo 
executionGraphInfo) {
                     executionGraphInfo.getArchivedExecutionGraph().getJobID(),
                     e);
         }
+    }
 
-        // do not create an archive for suspended jobs, as this would 
eventually lead to multiple
-        // archive attempts which we currently do not support
-        if 
(executionGraphInfo.getArchivedExecutionGraph().getState().isGloballyTerminalState())
 {
-            final CompletableFuture<Acknowledge> executionGraphFuture =
-                    
historyServerArchivist.archiveExecutionGraph(executionGraphInfo);
-
-            executionGraphFuture.whenComplete(
-                    (Acknowledge ignored, Throwable throwable) -> {
-                        if (throwable != null) {
-                            log.info(
-                                    "Could not archive completed job {}({}) to 
the history server.",
-                                    
executionGraphInfo.getArchivedExecutionGraph().getJobName(),
-                                    
executionGraphInfo.getArchivedExecutionGraph().getJobID(),
-                                    throwable);
-                        }
-                    });
+    private CompletableFuture<Acknowledge> 
archiveExecutionGraphToHistoryServer(
+            ExecutionGraphInfo executionGraphInfo) {
+
+        final CompletableFuture<Acknowledge> executionGraphFuture =
+                FutureUtils.orTimeout(
+                        
historyServerArchivist.archiveExecutionGraph(executionGraphInfo),
+                        
configuration.get(ClusterOptions.CLUSTER_SERVICES_SHUTDOWN_TIMEOUT),
+                        TimeUnit.MILLISECONDS,
+                        getMainThreadExecutor());

Review Comment:
   In our production we met a scenario that the user specifies 
`historyserver.archive.fs.dir` to a remote OSS path. It takes a long time to 
archive the files to OSS. If a timeout happens, it means that the storage is 
not reliable. I'm worried that if no timeout is added, the job will never end 
in the worst scenario. I fully understand that if a user enables the configs 
related to the history server, he/she wants to have the archiving finished. 
However, if the worst case happened, he/she has to shutdown the job manually, 
with no resource clean up. In my opinion, resource clean up is more important 
than archiving, even when the user enables the history server. Thus, I think a 
timeout is necessary for the archiving.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] Thesharing commented on a diff in pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Reply via email to