rmetzger commented on code in PR #978:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/978#discussion_r2076080196


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##########
@@ -299,9 +303,92 @@ public boolean 
reconcileOtherChanges(FlinkResourceContext<FlinkDeployment> ctx)
             return true;
         }
 
+        // check for JobManager exceptions if the REST API server is still up.
+        if (!ReconciliationUtils.isJobInTerminalState(deployment.getStatus())) 
{
+            observeJobManagerExceptions(ctx, deployment, observeConfig);
+        }
+
         return cleanupTerminalJmAfterTtl(ctx.getFlinkService(), deployment, 
observeConfig);
     }
 
+    private void observeJobManagerExceptions(
+            FlinkResourceContext<FlinkDeployment> ctx,
+            FlinkDeployment deployment,
+            Configuration observeConfig) {
+        try {
+            var jobId = 
JobID.fromHexString(deployment.getStatus().getJobStatus().getJobId());
+            var history = ctx.getFlinkService().getJobExceptions(deployment, 
jobId, observeConfig);
+            if (history == null || history.getExceptionHistory() == null) {
+                return;
+            }
+            var exceptionHistory = history.getExceptionHistory();
+            var exceptions = exceptionHistory.getEntries();
+            if (exceptions.isEmpty()) {
+                LOG.info(String.format("No exceptions found in job exception 
history for jobId '%s'.", jobId));

Review Comment:
   This log message might be too verbose, as ideally Flink jobs do not have 
exceptions under normal circumstances



##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##########
@@ -299,9 +303,92 @@ public boolean 
reconcileOtherChanges(FlinkResourceContext<FlinkDeployment> ctx)
             return true;
         }
 
+        // check for JobManager exceptions if the REST API server is still up.
+        if (!ReconciliationUtils.isJobInTerminalState(deployment.getStatus())) 
{
+            observeJobManagerExceptions(ctx, deployment, observeConfig);
+        }
+
         return cleanupTerminalJmAfterTtl(ctx.getFlinkService(), deployment, 
observeConfig);
     }
 
+    private void observeJobManagerExceptions(
+            FlinkResourceContext<FlinkDeployment> ctx,
+            FlinkDeployment deployment,
+            Configuration observeConfig) {
+        try {
+            var jobId = 
JobID.fromHexString(deployment.getStatus().getJobStatus().getJobId());
+            var history = ctx.getFlinkService().getJobExceptions(deployment, 
jobId, observeConfig);
+            if (history == null || history.getExceptionHistory() == null) {
+                return;
+            }
+            var exceptionHistory = history.getExceptionHistory();
+            var exceptions = exceptionHistory.getEntries();
+            if (exceptions.isEmpty()) {
+                LOG.info(String.format("No exceptions found in job exception 
history for jobId '%s'.", jobId));
+                return;
+            }
+            if (exceptionHistory.isTruncated()) {
+                LOG.warn(String.format("Job exception history is truncated for 
jobId '%s'. "
+                        + "Some exceptions are not shown.", jobId));
+            }
+            for (var exception : exceptions) {
+                emitJobManagerExceptionEvent(ctx, deployment, exception);
+            }
+        } catch (Exception e) {
+            LOG.warn("Could not fetch JobManager exception info.", e);
+        }
+    }
+
+    private void emitJobManagerExceptionEvent(
+            FlinkResourceContext<FlinkDeployment> ctx,
+            FlinkDeployment deployment,
+            JobExceptionsInfoWithHistory.RootExceptionInfo exception) {
+
+        String message = exception.getExceptionName();
+        if (message == null || message.isBlank()) {
+            return;
+        }
+
+        String stacktrace = exception.getStacktrace();
+        String taskName = exception.getTaskName();
+        String endpoint = exception.getEndpoint();
+        String tmId = exception.getTaskManagerId();
+        Map<String, String> labels = exception.getFailureLabels();
+        String time = 
DateTimeUtils.readable(Instant.ofEpochMilli(exception.getTimestamp()), 
ZoneId.systemDefault());
+
+        StringBuilder combined = new StringBuilder();
+        combined.append("JobManager Exception at ").append(time).append(":\n");
+        combined.append(message).append("\n\n");
+
+        if (taskName != null) {
+            combined.append("Task: ").append(taskName).append("\n");
+        }
+        if (endpoint != null) {
+            combined.append("Endpoint: ").append(endpoint).append("\n");
+        }
+        if (tmId != null) {
+            combined.append("TaskManager ID: ").append(tmId).append("\n");
+        }
+

Review Comment:
   Instead of putting metadata like task, time etc. into the message itself, is 
it possible to put this maybe into `metadata.annotations`?
   Can we set `eventTime` to the timestamp of the exception reported by the 
JobManager?
   For other reviewers, this is what the event looks like.
   
   ```
   - apiVersion: v1
     count: 1
     eventTime: null
     firstTimestamp: "2025-05-06T20:08:07Z"
     involvedObject:
       apiVersion: flink.apache.org/v1beta1
       kind: FlinkDeployment
       name: fraud-detection
       namespace: default
       uid: 769fe13a-65d3-42d2-a654-d452eff9c973
     kind: Event
     lastTimestamp: "2025-05-06T20:08:07Z"
     message: "JobManager Exception at 2025-05-06 
20:08:04:\norg.apache.flink.util.FlinkExpectedException\n\nTask:
       Flat Map -> Sink: Print to Std. Out (5/6) - execution #0\nEndpoint: 
10.244.1.203:43687\nTaskManager
       ID: 
fraud-detection-taskmanager-1-2\n\nStacktrace:\norg.apache.flink.util.FlinkExpectedException:
       The TaskExecutor is shutting down.\n\tat 
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:504)\n\tat
       
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:239)\n\tat
       
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.lambda$terminate$0(PekkoRpcActor.java:574)\n\tat
       
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)\n\tat
       
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.terminate(PekkoRpcActor.java:573)\n\tat
       
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleControlMessage(PekkoRpcActor.java:196)\n\tat
       
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)\n\tat
       
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)\n\tat 
   [shortened]
     metadata:
       creationTimestamp: "2025-05-06T20:08:07Z"
       name: JobManagerDeployment.1781076436
       namespace: default
       resourceVersion: "1634805"
       uid: a0be5684-1ed1-4f9a-a07b-3a61220063a7
     reason: JobManagerException
     reportingComponent: ""
     reportingInstance: ""
     source:
       component: JobManagerDeployment
     type: Warning
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to