teamconfx opened a new pull request, #27463: URL: https://github.com/apache/flink/pull/27463
This PR fixes [FLINK-38870](https://issues.apache.org/jira/browse/FLINK-38870). ### Problem When a JobManager loses leadership, jobs enter SUSPENDED state. The old error message "Job completed with illegal status: null" was uninformative and confusing. ### Solution My approach improves on the JIRA proposal by: 1. Preserving the actual SUSPENDED JobStatus instead of losing it (the proposal only added a field but kept setting it to null) 2. Adding serialization support to preserve SUSPENDED state across REST API calls 3. Maintaining backward compatibility with older clients ### Files Modified 1. JobResult.java - Constructor validation: Changed from requiring "globally terminal" to just "terminal" states, allowing SUSPENDED - createFrom(): Now stores actual JobStatus including SUSPENDED (was setting null for non-globally-terminal states) - toJobExecutionResult(): Added specific handling for SUSPENDED with detailed error message: Job is in state SUSPENDED. This commonly happens when the JobManager lost leadership. The job may recover automatically if High Availability and a persistent job store are configured. If recovery is not possible (e.g., non-persistent ExecutionPlanStore), the job needs to be resubmitted. 2. JobResultSerializer.java - Added new job-status field to preserve actual JobStatus in JSON (alongside existing application-status for backward compatibility) 3. JobResultDeserializer.java - Reads new job-status field if present (takes priority) - Falls back to application-status for backward compatibility with older messages 4. Tests Added - JobResultTest: 3 new tests for SUSPENDED state handling - JobResultDeserializerTest: 3 new tests for serialization with SUSPENDED state ### Test Results ``` Tests run: 17, Failures: 0, Errors: 0, Skipped: 0 BUILD SUCCESS ``` ### Error Message Comparison Before: ``` Job completed with illegal status: null. ``` After: ``` Job is in state SUSPENDED. This commonly happens when the JobManager lost leadership. The job may recover automatically if High Availability and a persistent job store are configured. If recovery is not possible (e.g., non-persistent ExecutionPlanStore), the job needs to be resubmitted. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
