dmvk commented on a change in pull request #18689:
URL: https://github.com/apache/flink/pull/18689#discussion_r803659647
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
return logger;
}
+ protected Throwable extractError(TaskExecutionStateTransition
taskExecutionStateTransition) {
+ Throwable cause =
taskExecutionStateTransition.getError(userCodeClassLoader);
+ if (cause == null) {
+ cause = new FlinkException("Unknown failure cause. Probably
related to FLINK-21376.");
+ }
+ return cause;
+ }
+
+ protected Optional<ExecutionVertexID> extractExecutionVertexID(
+ TaskExecutionStateTransition taskExecutionStateTransition) {
+ return
executionGraph.getExecutionVertexId(taskExecutionStateTransition.getID());
Review comment:
I'd like to avoid adding new methods to the execution graph.
`ExecutionGraph#getRegisteredExecutions` should be enough to cover this use
case.
Other think would be, that we don't really expect this not to be found ever,
so we can throw an exception right away if we don't find an entry.
DefaultScheduler does the same thing, even though it's also using optional,
it check that the option is not empty after successful update to the execution
graph.
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java
##########
@@ -80,49 +89,53 @@ public void cancel() {
getExecutionGraph(), getExecutionGraphHandler(),
getOperatorCoordinatorHandler());
}
+ private void handleFailure(Failure failure) {
+ failureCollection.add(failure);
+ FailureResult failureResult = context.howToHandleFailure(failure);
+ transitionOnFailure(failureResult);
+ }
+
@Override
public void handleGlobalFailure(Throwable cause) {
- handleAnyFailure(cause);
+ handleFailure(Failure.createGlobal(cause));
}
- private void handleAnyFailure(Throwable cause) {
- final FailureResult failureResult = context.howToHandleFailure(cause);
+ @Override
+ boolean updateTaskExecutionState(TaskExecutionStateTransition
taskExecutionStateTransition) {
+ final boolean successfulUpdate =
+ getExecutionGraph().updateState(taskExecutionStateTransition);
+
+ if (successfulUpdate
+ && taskExecutionStateTransition.getExecutionState() ==
ExecutionState.FAILED) {
+ handleFailure(
+ Failure.createLocal(
+ extractError(taskExecutionStateTransition),
+
extractExecutionVertexID(taskExecutionStateTransition)));
+ }
+
+ return successfulUpdate;
Review comment:
This methods is duplicated several times, basically in all states
extending `StateWithExecutionGraph` apart from `Cancelling`.
1) It feels that it could be moved up to the base class to avoid code
duplication
2) I think tasks can also fail when the job is cancelling (basically when
the cancel call on the operator throws an exception). Is this correct? If yes,
it would eliminate the need of treating the `Cancelling` state differently.
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
return logger;
}
+ protected Throwable extractError(TaskExecutionStateTransition
taskExecutionStateTransition) {
+ Throwable cause =
taskExecutionStateTransition.getError(userCodeClassLoader);
+ if (cause == null) {
+ cause = new FlinkException("Unknown failure cause. Probably
related to FLINK-21376.");
+ }
+ return cause;
+ }
+
+ protected Optional<ExecutionVertexID> extractExecutionVertexID(
+ TaskExecutionStateTransition taskExecutionStateTransition) {
+ return
executionGraph.getExecutionVertexId(taskExecutionStateTransition.getID());
+ }
+
+ protected static Optional<RootExceptionHistoryEntry> convertFailures(
+ Function<ExecutionVertexID, Optional<ExecutionVertex>> lookup,
Review comment:
This has a weird signature, why can't we simply pass an execution graph
here?
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
return logger;
}
+ protected Throwable extractError(TaskExecutionStateTransition
taskExecutionStateTransition) {
Review comment:
We can get rid of this method once we unify `updateTaskExecutionState`
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/failure/Failure.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.scheduler.adaptive.failure;
+
+import org.apache.flink.runtime.executiongraph.ExecutionVertex;
+import
org.apache.flink.runtime.scheduler.exceptionhistory.ExceptionHistoryEntry;
+import
org.apache.flink.runtime.scheduler.exceptionhistory.RootExceptionHistoryEntry;
+import org.apache.flink.runtime.scheduler.strategy.ExecutionVertexID;
+
+import java.util.Optional;
+import java.util.Set;
+import java.util.function.Function;
+
+/** Failure object. */
+public abstract class Failure {
+ private final Throwable cause;
+ private final long timestamp;
+
+ public Failure(Throwable cause) {
+ this.cause = cause;
+ this.timestamp = System.currentTimeMillis();
Review comment:
Not really, the DefaultlScheduler does the same thing. Unless the
timestamp would be reported by the TM, there is not much we can do
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StopWithSavepoint.java
##########
@@ -134,30 +143,53 @@ public JobStatus getJobStatus() {
return JobStatus.RUNNING;
}
+ private void handleFailure(Failure failure) {
+ failureCollection.add(failure);
+ FailureResult failureResult = context.howToHandleFailure(failure);
+ transitionOnFailure(failureResult);
+ }
+
@Override
public void handleGlobalFailure(Throwable cause) {
- handleAnyFailure(cause);
+ handleFailure(Failure.createGlobal(cause));
}
@Override
boolean updateTaskExecutionState(TaskExecutionStateTransition
taskExecutionStateTransition) {
final boolean successfulUpdate =
getExecutionGraph().updateState(taskExecutionStateTransition);
- if (successfulUpdate) {
- if (taskExecutionStateTransition.getExecutionState() ==
ExecutionState.FAILED) {
- Throwable cause =
taskExecutionStateTransition.getError(userCodeClassLoader);
- handleAnyFailure(
- cause == null
- ? new FlinkException(
- "Unknown failure cause. Probably
related to FLINK-21376.")
- : cause);
- }
+ if (successfulUpdate
+ && taskExecutionStateTransition.getExecutionState() ==
ExecutionState.FAILED) {
+ handleFailure(
+ Failure.createLocal(
+ extractError(taskExecutionStateTransition),
+
extractExecutionVertexID(taskExecutionStateTransition)));
}
return successfulUpdate;
}
+ private void transitionOnFailure(FailureResult failureResult) {
Review comment:
This is duplicated in `Executing` state.
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
return logger;
}
+ protected Throwable extractError(TaskExecutionStateTransition
taskExecutionStateTransition) {
+ Throwable cause =
taskExecutionStateTransition.getError(userCodeClassLoader);
+ if (cause == null) {
+ cause = new FlinkException("Unknown failure cause. Probably
related to FLINK-21376.");
+ }
+ return cause;
+ }
+
+ protected Optional<ExecutionVertexID> extractExecutionVertexID(
+ TaskExecutionStateTransition taskExecutionStateTransition) {
+ return
executionGraph.getExecutionVertexId(taskExecutionStateTransition.getID());
+ }
+
+ protected static Optional<RootExceptionHistoryEntry> convertFailures(
+ Function<ExecutionVertexID, Optional<ExecutionVertex>> lookup,
+ List<Failure> failureCollection) {
+ if (failureCollection.isEmpty()) {
+ return Optional.empty();
+ }
+ Failure first = failureCollection.remove(0);
+ Set<ExceptionHistoryEntry> entries = new HashSet<>();
+ for (Failure failure : failureCollection) {
+ entries.add(failure.toExceptionHistoryEntry(lookup));
Review comment:
Wouldn't simply implementing hashCode & equals for the Failure object do
the trick?
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java
##########
@@ -1576,4 +1576,15 @@ public ExecutionDeploymentListener
getExecutionDeploymentListener() {
public boolean isDynamic() {
return isDynamic;
}
+
+ @Override
+ public Optional<ExecutionVertexID> getExecutionVertexId(ExecutionAttemptID
id) {
+ Execution execution = this.getRegisteredExecutions().get(id);
+ return
Optional.ofNullable(execution).map(Execution::getVertex).map(ExecutionVertex::getID);
+ }
+
+ @Override
+ public Optional<ExecutionVertex> getExecutionVertex(final
ExecutionVertexID executionVertexId) {
Review comment:
We already have `DefaultExecutionGraph#getExecutionVertexOrThrow` in
place, so we should avoid adding a new method
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]