Hi!
Thanks so much for your feedback and suggestions. I have updated the FLIP accordingly and truly appreciate your input! Best, Yi At 2026-01-15 14:20:53, [email protected] wrote: >Hi! > >I think a separate endpoint with the effective config makes perfect sense and >will cover the requirements. > >Thank you for including :) > >Gyula > >Sent from my iPhone > >> On 15 Jan 2026, at 06:50, Yi Zhang <[email protected]> wrote: >> >> Hi Gyula, >> >> >> I have given this some more thought, and I agree with your >> point: users (and external controllers like the Kubernetes >> Operator) need access to critical context such as the state >> recovery path, even in cases where no job has been >> successfully submitted. >> >> >> What if we introduce a new REST API endpoint, such >> as /applications/:applicationid/jobmanager/config, >> to expose the effective JobManager configuration used (or >> intended to be used) by the application? This could include >> key settings like state recovery path and other relevant >> configured options. >> >> >> It might help make the API responsibilities clearer, and also >> provide valuable visibility even when no errors occur. For >> actual error details, the >> /applications/:applicationid/exceptions endpoint can be >> used. >> >> >> I’d appreciate your thoughts on this approach. Thanks! >> >> >> Best, >> Yi >> >> At 2026-01-12 15:23:58, "Gyula Fóra" <[email protected]> wrote: >>> Hi! >>> >>> Overall it makes sense, I cannot reproduce a job actually >>> submitted/archived properly with invalid savepoint path (I am testing on >>> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes >>> this seems to cause a jobmanager shutdown immediately. >>> >>> You are right, I am mostly concerned about all the error scenarios that >>> currently do not result in FAILED job submissions. >>> >>> Regarding what error metadata to expose, I think what you write makes >>> sense, with the only specific exception that checkpoint/state recovery >>> information (what checkpoint are we restoring from at the moment) should >>> always be included in the error/job/app metadata. This is crucial for the >>> Kubernetes Operator (and possible other external control planes) to handle >>> the error. Currently since this information is sometimes lost, it leads to >>> many cornercases requiring manual intervention from users. >>> >>> Cheers >>> Gyula >>> >>>> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang <[email protected]> wrote: >>>> >>>> >>>> >>>> Hi Gyula, >>>> >>>> >>>> Thank you very much for your explanation. >>>> >>>> >>>> "Some errors such as invalid state path are not even submitted or when it >>>> is Flink >>>> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't >>>> actually include information about the checkpoint restore path/configs >>>> etc." >>>> >>>> >>>> >>>> I ran some tests on the latest Flink 2.3 release by submitting a job with >>>> an invalid >>>> `execution.state-recovery.path`. The job submission itself succeeded, but >>>> the job >>>> failed during initialization. It seems that at least for misconfigured >>>> state recovery >>>> paths, the job still goes through submission and gets archived with >>>> sufficient >>>> diagnostic info. Am I missing anything here? If there’s a specific Jira >>>> issue >>>> describing such a scenario, it would be great to reference it for more >>>> concrete >>>> discussion around this requirements. >>>> >>>> >>>> That said, I agree that there are different error scenarios we might >>>> encounter, which >>>> broadly fall into three categories: >>>> >>>> 1. Failures after successful job submission, which result in a FAILED job >>>> state. In >>>> these cases, relevant diagnostics are already accessible via existing >>>> job-related >>>> REST endpoints. >>>> 2. Failures during job submission, leaving no concrete job entity to query. >>>> 3. Failures in the main() method unrelated to job submission/execution. >>>> >>>> The originally proposed /applications/:applicationid/exceptions endpoint >>>> is intended >>>> to expose exceptions from all three categories. From my understanding, your >>>> primary interest lies in scenario #2, where additional context could help >>>> diagnose >>>> why submission failed, even though no real job was created. >>>> >>>> >>>> Rather than introducing a general endpoint that exposes all possible >>>> configuration >>>> and metadata, would it be more practical to conditionally enrich >>>> exceptions? For >>>> example, when a submission fails due to invalid state paths or >>>> misconfigured >>>> options, we could attach the relevant configuration settings. This >>>> approach would >>>> complement the /applications/:applicationid/exceptions design and allow us >>>> to >>>> incrementally evolve toward richer diagnostics over time. >>>> Having a concrete use case would greatly help align on the scope and >>>> implementation details of such enrichment. >>>> >>>> >>>> Thanks again for your valuable feedback and suggestions! >>>> >>>> >>>> Best, >>>> Yi >>>> >>>> P.S. I’ve updated the FLIP to reflect the change regarding using job name >>>> for job >>>> matching. Please let me know if you have any further questions or >>>> suggestions. >>>> >>>> >>>> At 2026-01-08 17:07:16, "Gyula Fóra" <[email protected]> wrote: >>>>> Hi Yi! >>>>> >>>>> Sorry for the late reply, I somehow missed your response: >>>>> >>>>> "Flink’s existing archive mechanism—combined >>>>> with the HistoryServer—already provides persistent access to job-related >>>>> information after failure. Specifically, the existing HistoryServer >>>> endpoint >>>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the >>>> configuration >>>>> >>>>> including checkpoint restore paths, and remains accessible after failure." >>>>> >>>>> You are right, when a job fails this is true we can see the past >>>> checkpoint >>>>> history etc. But I think this doesn't apply for jobs that faile during >>>>> submission or in the main method. Some errors such as invalid state path >>>>> are not even submitted or when it is Flink uses >>>>> ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't >>>>> actually include information about the checkpoint restore path/configs >>>> etc. >>>>> >>>>> Cheers >>>>> Gyula >>>>> >>>>> On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang <[email protected]> wrote: >>>>> >>>>>> Hi Gyula, >>>>>> >>>>>> >>>>>> Thank you so much for your thoughtful and insightful feedback! >>>>>> >>>>>> >>>>>> 1. I fully agree that using the job name for job matching is more >>>>>> user-friendly and >>>>>> cleaner than relying on a jobIndex parameter. I’ll update the FLIP >>>>>> accordingly >>>>>> to reflect this design change. >>>>>> >>>>>> >>>>>> 2. I’d like to dig a bit deeper to make sure I fully understand the >>>>>> requirement. >>>>>> You have mentioned the need for a generic information endpoint that >>>>>> remains >>>>>> accessible even after failure, and that it should include additional >>>> info >>>>>> such as >>>>>> the checkpoint restore path and configuration. >>>>>> >>>>>> From my current understanding, Flink’s existing archive >>>> mechanism—combined >>>>>> with the HistoryServer—already provides persistent access to job-related >>>>>> information after failure. Specifically, the existing HistoryServer >>>>>> endpoint >>>>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the >>>>>> configuration >>>>>> including checkpoint restore paths, and remains accessible after >>>> failure. >>>>>> On the other hand, the proposed /applications/:appid/exceptions endpoint >>>>>> is >>>>>> intended specifically to surface application-level exceptions that occur >>>>>> outside >>>>>> the job lifecycle, which will also be available through the >>>> HistoryServer >>>>>> after >>>>>> failure. >>>>>> >>>>>> So could you help clarify whether there is a specific failure scenario >>>> or >>>>>> use case >>>>>> where the current archiving/HistoryServer mechanism falls short or where >>>>>> critical >>>>>> debugging information—like the restore path or configuration—is not >>>>>> retrievable >>>>>> after a failure? >>>>>> >>>>>> >>>>>> Thanks again for your excellent suggestions! >>>>>> >>>>>> Best, >>>>>> Yi >>>>>> >>>>>> At 2025-12-25 21:08:49, "Gyula Fóra" <[email protected]> wrote: >>>>>>> Hi! >>>>>>> >>>>>>> Overall I think the design/improvements look great. Some minor >>>> comments, >>>>>>> improvement possibilities: >>>>>>> >>>>>>> 1. Could we simply use the job name for job matching? I think it's >>>> fair to >>>>>>> require unique job names (or if they are not unique attach a sequence >>>>>>> number to the name) instead of the jobIndex parameter. JobIndex sounds >>>> a >>>>>>> bit weird and low level. >>>>>>> >>>>>>> 2.A big problem/limitation of the existing submission logic is that the >>>>>>> submit-on-error logic is very limited (only handling certain types of >>>>>>> errors and only showing exception info). We should capture different >>>>>> errors >>>>>>> and metadata for failed applications including checkpoint settings (for >>>>>>> instance what checkpoint path was used during restore, which is a >>>> common >>>>>>> cause of the errors). So instead of introducing a >>>>>>> /applications/appid/exceptions endpoint, can we instead introduce a >>>> more >>>>>>> generic information endpoint that would contain other information? This >>>>>>> endpoint should be accessible even in cause of failures and populated >>>> from >>>>>>> the app result store and should also contain some other info such as >>>>>>> checkpoint restore path, configuration etc. >>>>>>> >>>>>>> Capturing more information on failed submissions would help resolve a >>>> lot >>>>>>> of long outstanding issues in the Flink Kubernetes Operator as well. >>>>>>> >>>>>>> Cheers >>>>>>> Gyula >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 25, 2025 at 1:54 PM Lei Yang <[email protected]> wrote: >>>>>>> >>>>>>>> Thank you Yi for your reply, looks good to me! >>>>>>>> +1 for this proposal >>>>>>>> Best, >>>>>>>> Lei >>>>>>>> >>>>>>>> Yi Zhang <[email protected]> 于2025年12月25日周四 10:02写道: >>>>>>>> >>>>>>>>> Hi Lei, >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you for the feedback! >>>>>>>>> The "Archiving Directory Structure" section describes a change in >>>> how >>>>>>>>> archived >>>>>>>>> files are organized under jobmanager.archive.fs.dir. While this >>>> change >>>>>>>> was >>>>>>>>> originally proposed in FLIP-549, it's indeed a significant >>>>>>>>> application-level update, >>>>>>>>> so I'm glad to have the chance to clarify it here. >>>>>>>>> >>>>>>>>> >>>>>>>>> To answer your question directly: backward compatibility is fully >>>>>>>>> preserved. >>>>>>>>> >>>>>>>>> >>>>>>>>> In earlier Flink versions, job archives were written directly under >>>>>> the >>>>>>>>> configured >>>>>>>>> jobmanager.archive.fs.dir. With this update, Flink will instead >>>> use a >>>>>>>>> hierarchical >>>>>>>>> cluster-application-job structure. >>>>>>>>> We understand that many users already have archives stored in the >>>>>> legacy >>>>>>>>> flat >>>>>>>>> layout. To ensure a smooth transition, the History Server will be >>>>>> updated >>>>>>>>> to read >>>>>>>>> archives from both the old and new directory structures. As a >>>> result, >>>>>> all >>>>>>>>> previously archived jobs will remain accessible and visible. >>>>>>>>> >>>>>>>>> >>>>>>>>> If you have additional questions or specific edge cases in mind, >>>> I’d >>>>>> be >>>>>>>>> happy to >>>>>>>>> discuss them further! >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Yi >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> At 2025-12-24 11:35:00, "Lei Yang" <[email protected]> wrote: >>>>>>>>>> Hi Yi, >>>>>>>>>> >>>>>>>>>> Thank you for creating this FLIP! The introduction of the >>>> Application >>>>>>>>>> entity significantly enhances the observability and manageability >>>> of >>>>>>>>>> user logic, especially benefiting batch workloads. This is truly >>>>>>>>>> excellent work! >>>>>>>>>> >>>>>>>>>> However, I have a compatibility concern and would appreciate your >>>>>>>>>> clarification. In the “Archiving Directory Structure” section, I >>>>>> noticed >>>>>>>>>> that the directory structure has been changed. If users have >>>>>> configured >>>>>>>>>> a persistent external path for jobmanager.archive.fs.dir, will >>>> their >>>>>>>>>> existing archives become unreadable after this change? Will the >>>>>>>>>> implementation of this FLIP maintain backward compatibility with >>>>>>>>>> previously archived job data? >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> Lei >>>>>>>>>> >>>>>>>>>> Yi Zhang <[email protected]> 于2025年12月17日周三 14:18写道: >>>>>>>>>> >>>>>>>>>>> Hi everyone, >>>>>>>>>>> >>>>>>>>>>> I would like to start a discussion about FLIP-560: Application >>>>>>>>> Capability >>>>>>>>>>> Enhancement [1]. >>>>>>>>>>> >>>>>>>>>>> The primary goal of this FLIP is to improve the usability and >>>>>>>>> availability >>>>>>>>>>> of Flink applications >>>>>>>>>>> >>>>>>>>>>> by introducing the following enhancements: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 1. Support multi-job execution in Application Mode, which is an >>>>>>>>> important >>>>>>>>>>> batch-processing use case. >>>>>>>>>>> 2. Support re-running the user's main method after JobManager >>>>>> restarts >>>>>>>>> due >>>>>>>>>>> to failures in Session Mode. >>>>>>>>>>> 3. Expose exceptions thrown in the user's main method via >>>> REST/UI. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Looking forward to your feedback and suggestions! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> >>>>>>>>>>> Yi Zhang >>>>>>>>> >>>>>>>> >>>>>> >>>>
