Hi! I think a separate endpoint with the effective config makes perfect sense and will cover the requirements.
Thank you for including :) Gyula Sent from my iPhone > On 15 Jan 2026, at 06:50, Yi Zhang <[email protected]> wrote: > > Hi Gyula, > > > I have given this some more thought, and I agree with your > point: users (and external controllers like the Kubernetes > Operator) need access to critical context such as the state > recovery path, even in cases where no job has been > successfully submitted. > > > What if we introduce a new REST API endpoint, such > as /applications/:applicationid/jobmanager/config, > to expose the effective JobManager configuration used (or > intended to be used) by the application? This could include > key settings like state recovery path and other relevant > configured options. > > > It might help make the API responsibilities clearer, and also > provide valuable visibility even when no errors occur. For > actual error details, the > /applications/:applicationid/exceptions endpoint can be > used. > > > I’d appreciate your thoughts on this approach. Thanks! > > > Best, > Yi > > At 2026-01-12 15:23:58, "Gyula Fóra" <[email protected]> wrote: >> Hi! >> >> Overall it makes sense, I cannot reproduce a job actually >> submitted/archived properly with invalid savepoint path (I am testing on >> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes >> this seems to cause a jobmanager shutdown immediately. >> >> You are right, I am mostly concerned about all the error scenarios that >> currently do not result in FAILED job submissions. >> >> Regarding what error metadata to expose, I think what you write makes >> sense, with the only specific exception that checkpoint/state recovery >> information (what checkpoint are we restoring from at the moment) should >> always be included in the error/job/app metadata. This is crucial for the >> Kubernetes Operator (and possible other external control planes) to handle >> the error. Currently since this information is sometimes lost, it leads to >> many cornercases requiring manual intervention from users. >> >> Cheers >> Gyula >> >>> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang <[email protected]> wrote: >>> >>> >>> >>> Hi Gyula, >>> >>> >>> Thank you very much for your explanation. >>> >>> >>> "Some errors such as invalid state path are not even submitted or when it >>> is Flink >>> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't >>> actually include information about the checkpoint restore path/configs >>> etc." >>> >>> >>> >>> I ran some tests on the latest Flink 2.3 release by submitting a job with >>> an invalid >>> `execution.state-recovery.path`. The job submission itself succeeded, but >>> the job >>> failed during initialization. It seems that at least for misconfigured >>> state recovery >>> paths, the job still goes through submission and gets archived with >>> sufficient >>> diagnostic info. Am I missing anything here? If there’s a specific Jira >>> issue >>> describing such a scenario, it would be great to reference it for more >>> concrete >>> discussion around this requirements. >>> >>> >>> That said, I agree that there are different error scenarios we might >>> encounter, which >>> broadly fall into three categories: >>> >>> 1. Failures after successful job submission, which result in a FAILED job >>> state. In >>> these cases, relevant diagnostics are already accessible via existing >>> job-related >>> REST endpoints. >>> 2. Failures during job submission, leaving no concrete job entity to query. >>> 3. Failures in the main() method unrelated to job submission/execution. >>> >>> The originally proposed /applications/:applicationid/exceptions endpoint >>> is intended >>> to expose exceptions from all three categories. From my understanding, your >>> primary interest lies in scenario #2, where additional context could help >>> diagnose >>> why submission failed, even though no real job was created. >>> >>> >>> Rather than introducing a general endpoint that exposes all possible >>> configuration >>> and metadata, would it be more practical to conditionally enrich >>> exceptions? For >>> example, when a submission fails due to invalid state paths or >>> misconfigured >>> options, we could attach the relevant configuration settings. This >>> approach would >>> complement the /applications/:applicationid/exceptions design and allow us >>> to >>> incrementally evolve toward richer diagnostics over time. >>> Having a concrete use case would greatly help align on the scope and >>> implementation details of such enrichment. >>> >>> >>> Thanks again for your valuable feedback and suggestions! >>> >>> >>> Best, >>> Yi >>> >>> P.S. I’ve updated the FLIP to reflect the change regarding using job name >>> for job >>> matching. Please let me know if you have any further questions or >>> suggestions. >>> >>> >>> At 2026-01-08 17:07:16, "Gyula Fóra" <[email protected]> wrote: >>>> Hi Yi! >>>> >>>> Sorry for the late reply, I somehow missed your response: >>>> >>>> "Flink’s existing archive mechanism—combined >>>> with the HistoryServer—already provides persistent access to job-related >>>> information after failure. Specifically, the existing HistoryServer >>> endpoint >>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the >>> configuration >>>> >>>> including checkpoint restore paths, and remains accessible after failure." >>>> >>>> You are right, when a job fails this is true we can see the past >>> checkpoint >>>> history etc. But I think this doesn't apply for jobs that faile during >>>> submission or in the main method. Some errors such as invalid state path >>>> are not even submitted or when it is Flink uses >>>> ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't >>>> actually include information about the checkpoint restore path/configs >>> etc. >>>> >>>> Cheers >>>> Gyula >>>> >>>> On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang <[email protected]> wrote: >>>> >>>>> Hi Gyula, >>>>> >>>>> >>>>> Thank you so much for your thoughtful and insightful feedback! >>>>> >>>>> >>>>> 1. I fully agree that using the job name for job matching is more >>>>> user-friendly and >>>>> cleaner than relying on a jobIndex parameter. I’ll update the FLIP >>>>> accordingly >>>>> to reflect this design change. >>>>> >>>>> >>>>> 2. I’d like to dig a bit deeper to make sure I fully understand the >>>>> requirement. >>>>> You have mentioned the need for a generic information endpoint that >>>>> remains >>>>> accessible even after failure, and that it should include additional >>> info >>>>> such as >>>>> the checkpoint restore path and configuration. >>>>> >>>>> From my current understanding, Flink’s existing archive >>> mechanism—combined >>>>> with the HistoryServer—already provides persistent access to job-related >>>>> information after failure. Specifically, the existing HistoryServer >>>>> endpoint >>>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the >>>>> configuration >>>>> including checkpoint restore paths, and remains accessible after >>> failure. >>>>> On the other hand, the proposed /applications/:appid/exceptions endpoint >>>>> is >>>>> intended specifically to surface application-level exceptions that occur >>>>> outside >>>>> the job lifecycle, which will also be available through the >>> HistoryServer >>>>> after >>>>> failure. >>>>> >>>>> So could you help clarify whether there is a specific failure scenario >>> or >>>>> use case >>>>> where the current archiving/HistoryServer mechanism falls short or where >>>>> critical >>>>> debugging information—like the restore path or configuration—is not >>>>> retrievable >>>>> after a failure? >>>>> >>>>> >>>>> Thanks again for your excellent suggestions! >>>>> >>>>> Best, >>>>> Yi >>>>> >>>>> At 2025-12-25 21:08:49, "Gyula Fóra" <[email protected]> wrote: >>>>>> Hi! >>>>>> >>>>>> Overall I think the design/improvements look great. Some minor >>> comments, >>>>>> improvement possibilities: >>>>>> >>>>>> 1. Could we simply use the job name for job matching? I think it's >>> fair to >>>>>> require unique job names (or if they are not unique attach a sequence >>>>>> number to the name) instead of the jobIndex parameter. JobIndex sounds >>> a >>>>>> bit weird and low level. >>>>>> >>>>>> 2.A big problem/limitation of the existing submission logic is that the >>>>>> submit-on-error logic is very limited (only handling certain types of >>>>>> errors and only showing exception info). We should capture different >>>>> errors >>>>>> and metadata for failed applications including checkpoint settings (for >>>>>> instance what checkpoint path was used during restore, which is a >>> common >>>>>> cause of the errors). So instead of introducing a >>>>>> /applications/appid/exceptions endpoint, can we instead introduce a >>> more >>>>>> generic information endpoint that would contain other information? This >>>>>> endpoint should be accessible even in cause of failures and populated >>> from >>>>>> the app result store and should also contain some other info such as >>>>>> checkpoint restore path, configuration etc. >>>>>> >>>>>> Capturing more information on failed submissions would help resolve a >>> lot >>>>>> of long outstanding issues in the Flink Kubernetes Operator as well. >>>>>> >>>>>> Cheers >>>>>> Gyula >>>>>> >>>>>> >>>>>> On Thu, Dec 25, 2025 at 1:54 PM Lei Yang <[email protected]> wrote: >>>>>> >>>>>>> Thank you Yi for your reply, looks good to me! >>>>>>> +1 for this proposal >>>>>>> Best, >>>>>>> Lei >>>>>>> >>>>>>> Yi Zhang <[email protected]> 于2025年12月25日周四 10:02写道: >>>>>>> >>>>>>>> Hi Lei, >>>>>>>> >>>>>>>> >>>>>>>> Thank you for the feedback! >>>>>>>> The "Archiving Directory Structure" section describes a change in >>> how >>>>>>>> archived >>>>>>>> files are organized under jobmanager.archive.fs.dir. While this >>> change >>>>>>> was >>>>>>>> originally proposed in FLIP-549, it's indeed a significant >>>>>>>> application-level update, >>>>>>>> so I'm glad to have the chance to clarify it here. >>>>>>>> >>>>>>>> >>>>>>>> To answer your question directly: backward compatibility is fully >>>>>>>> preserved. >>>>>>>> >>>>>>>> >>>>>>>> In earlier Flink versions, job archives were written directly under >>>>> the >>>>>>>> configured >>>>>>>> jobmanager.archive.fs.dir. With this update, Flink will instead >>> use a >>>>>>>> hierarchical >>>>>>>> cluster-application-job structure. >>>>>>>> We understand that many users already have archives stored in the >>>>> legacy >>>>>>>> flat >>>>>>>> layout. To ensure a smooth transition, the History Server will be >>>>> updated >>>>>>>> to read >>>>>>>> archives from both the old and new directory structures. As a >>> result, >>>>> all >>>>>>>> previously archived jobs will remain accessible and visible. >>>>>>>> >>>>>>>> >>>>>>>> If you have additional questions or specific edge cases in mind, >>> I’d >>>>> be >>>>>>>> happy to >>>>>>>> discuss them further! >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> Yi >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> At 2025-12-24 11:35:00, "Lei Yang" <[email protected]> wrote: >>>>>>>>> Hi Yi, >>>>>>>>> >>>>>>>>> Thank you for creating this FLIP! The introduction of the >>> Application >>>>>>>>> entity significantly enhances the observability and manageability >>> of >>>>>>>>> user logic, especially benefiting batch workloads. This is truly >>>>>>>>> excellent work! >>>>>>>>> >>>>>>>>> However, I have a compatibility concern and would appreciate your >>>>>>>>> clarification. In the “Archiving Directory Structure” section, I >>>>> noticed >>>>>>>>> that the directory structure has been changed. If users have >>>>> configured >>>>>>>>> a persistent external path for jobmanager.archive.fs.dir, will >>> their >>>>>>>>> existing archives become unreadable after this change? Will the >>>>>>>>> implementation of this FLIP maintain backward compatibility with >>>>>>>>> previously archived job data? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Lei >>>>>>>>> >>>>>>>>> Yi Zhang <[email protected]> 于2025年12月17日周三 14:18写道: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> I would like to start a discussion about FLIP-560: Application >>>>>>>> Capability >>>>>>>>>> Enhancement [1]. >>>>>>>>>> >>>>>>>>>> The primary goal of this FLIP is to improve the usability and >>>>>>>> availability >>>>>>>>>> of Flink applications >>>>>>>>>> >>>>>>>>>> by introducing the following enhancements: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1. Support multi-job execution in Application Mode, which is an >>>>>>>> important >>>>>>>>>> batch-processing use case. >>>>>>>>>> 2. Support re-running the user's main method after JobManager >>>>> restarts >>>>>>>> due >>>>>>>>>> to failures in Session Mode. >>>>>>>>>> 3. Expose exceptions thrown in the user's main method via >>> REST/UI. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Looking forward to your feedback and suggestions! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Best Regards, >>>>>>>>>> >>>>>>>>>> Yi Zhang >>>>>>>> >>>>>>> >>>>> >>>
