[GitHub] [flink] metaswirl commented on a change in pull request #18689: [FLINK-21439][runtime] Exception history adaptive scheduler

GitBox Thu, 17 Feb 2022 01:41:54 -0800


metaswirl commented on a change in pull request #18689:
URL: https://github.com/apache/flink/pull/18689#discussion_r808850854




##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -306,22 +327,87 @@ void deliverOperatorEventToCoordinator(
                 operatorId, request);
     }
 
+    /** Transition to different state when failure occurs. Stays in the same 
state by default. */
+    abstract void onFailure(Throwable cause);
+
+    <T extends StateTransitions.ToRestarting & StateTransitions.ToFailing> 
void restartOrFail(
+            FailureResult failureResult, T context) {
+        if (failureResult.canRestart()) {
+            getLogger().info("Restarting job.", 
failureResult.getFailureCause());
+            context.goToRestarting(
+                    getExecutionGraph(),
+                    getExecutionGraphHandler(),
+                    getOperatorCoordinatorHandler(),
+                    failureResult.getBackoffTime(),
+                    getFailures());
+        } else {
+            getLogger().info("Failing job.", failureResult.getFailureCause());
+            context.goToFailing(
+                    getExecutionGraph(),
+                    getExecutionGraphHandler(),
+                    getOperatorCoordinatorHandler(),
+                    failureResult.getFailureCause(),
+                    getFailures());
+        }
+    }
+
+    /**
+     * Transition to different state when the execution graph reaches a 
globally terminal state.
+     *
+     * @param globallyTerminalState globally terminal state which the 
execution graph reached
+     */
+    abstract void onGloballyTerminalState(JobStatus globallyTerminalState);
+
+    @Override
+    public void handleGlobalFailure(Throwable cause) {
+        failureCollection.add(ExceptionHistoryEntry.createGlobal(cause));
+        onFailure(cause);
+    }
+
     /**
      * Updates the execution graph with the given task execution state 
transition.
      *
      * @param taskExecutionStateTransition taskExecutionStateTransition to 
update the ExecutionGraph
      *     with
      * @return {@code true} if the update was successful; otherwise {@code 
false}
      */
-    abstract boolean updateTaskExecutionState(
-            TaskExecutionStateTransition taskExecutionStateTransition);
+    boolean updateTaskExecutionState(TaskExecutionStateTransition 
taskExecutionStateTransition) {
+        // collect before updateState, as updateState may deregister the 
execution
+        final Optional<AccessExecution> maybeExecution =
+                
executionGraph.findExecution(taskExecutionStateTransition.getID());
+        final Optional<String> maybeTaskName =
+                
executionGraph.findVertexWithAttempt(taskExecutionStateTransition.getID());
+
+        final ExecutionState desiredState = 
taskExecutionStateTransition.getExecutionState();
+        boolean successfulUpdate = 
getExecutionGraph().updateState(taskExecutionStateTransition);
+        if (successfulUpdate && desiredState == ExecutionState.FAILED) {
+            final AccessExecution execution =
+                    maybeExecution.orElseThrow(NoSuchElementException::new);
+            final String taskName = 
maybeTaskName.orElseThrow(NoSuchElementException::new);
+            final ExecutionState currentState = execution.getState();
+            if (currentState == desiredState) {
+                failureCollection.add(ExceptionHistoryEntry.create(execution, 
taskName));

Review comment:
       The failureCollection contains all exceptions that were thrown in the 
interval from the initial failure to the job restart or cancellation. Usually 
this should be no more than one exception. Local failures are bounded by the 
number of tasks (actually the number is currently limited to 1). I don't know 
whether/how the number of global failures is bounded, but all failures would 
have to arrive in this short time interval.  
   
   Given these restrictions, I don't think that having this collection 
unbounded is really dangerous. 
   
   If you want to change it anyway, should we add a new config parameter for 
this? Were should we add this, also to the `WebOptions`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] metaswirl commented on a change in pull request #18689: [FLINK-21439][runtime] Exception history adaptive scheduler

Reply via email to