Hi! Overall it makes sense, I cannot reproduce a job actually submitted/archived properly with invalid savepoint path (I am testing on 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes this seems to cause a jobmanager shutdown immediately.
You are right, I am mostly concerned about all the error scenarios that currently do not result in FAILED job submissions. Regarding what error metadata to expose, I think what you write makes sense, with the only specific exception that checkpoint/state recovery information (what checkpoint are we restoring from at the moment) should always be included in the error/job/app metadata. This is crucial for the Kubernetes Operator (and possible other external control planes) to handle the error. Currently since this information is sometimes lost, it leads to many cornercases requiring manual intervention from users. Cheers Gyula On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang <[email protected]> wrote: > > > Hi Gyula, > > > Thank you very much for your explanation. > > > "Some errors such as invalid state path are not even submitted or when it > is Flink > uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't > actually include information about the checkpoint restore path/configs > etc." > > > > I ran some tests on the latest Flink 2.3 release by submitting a job with > an invalid > `execution.state-recovery.path`. The job submission itself succeeded, but > the job > failed during initialization. It seems that at least for misconfigured > state recovery > paths, the job still goes through submission and gets archived with > sufficient > diagnostic info. Am I missing anything here? If there’s a specific Jira > issue > describing such a scenario, it would be great to reference it for more > concrete > discussion around this requirements. > > > That said, I agree that there are different error scenarios we might > encounter, which > broadly fall into three categories: > > 1. Failures after successful job submission, which result in a FAILED job > state. In > these cases, relevant diagnostics are already accessible via existing > job-related > REST endpoints. > 2. Failures during job submission, leaving no concrete job entity to query. > 3. Failures in the main() method unrelated to job submission/execution. > > The originally proposed /applications/:applicationid/exceptions endpoint > is intended > to expose exceptions from all three categories. From my understanding, your > primary interest lies in scenario #2, where additional context could help > diagnose > why submission failed, even though no real job was created. > > > Rather than introducing a general endpoint that exposes all possible > configuration > and metadata, would it be more practical to conditionally enrich > exceptions? For > example, when a submission fails due to invalid state paths or > misconfigured > options, we could attach the relevant configuration settings. This > approach would > complement the /applications/:applicationid/exceptions design and allow us > to > incrementally evolve toward richer diagnostics over time. > Having a concrete use case would greatly help align on the scope and > implementation details of such enrichment. > > > Thanks again for your valuable feedback and suggestions! > > > Best, > Yi > > P.S. I’ve updated the FLIP to reflect the change regarding using job name > for job > matching. Please let me know if you have any further questions or > suggestions. > > > At 2026-01-08 17:07:16, "Gyula Fóra" <[email protected]> wrote: > >Hi Yi! > > > >Sorry for the late reply, I somehow missed your response: > > > >"Flink’s existing archive mechanism—combined > >with the HistoryServer—already provides persistent access to job-related > >information after failure. Specifically, the existing HistoryServer > endpoint > >`/jobs/:jobid/jobmanager/config` seems capable of exposing the > configuration > > > >including checkpoint restore paths, and remains accessible after failure." > > > >You are right, when a job fails this is true we can see the past > checkpoint > >history etc. But I think this doesn't apply for jobs that faile during > >submission or in the main method. Some errors such as invalid state path > >are not even submitted or when it is Flink uses > >ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't > >actually include information about the checkpoint restore path/configs > etc. > > > >Cheers > >Gyula > > > >On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang <[email protected]> wrote: > > > >> Hi Gyula, > >> > >> > >> Thank you so much for your thoughtful and insightful feedback! > >> > >> > >> 1. I fully agree that using the job name for job matching is more > >> user-friendly and > >> cleaner than relying on a jobIndex parameter. I’ll update the FLIP > >> accordingly > >> to reflect this design change. > >> > >> > >> 2. I’d like to dig a bit deeper to make sure I fully understand the > >> requirement. > >> You have mentioned the need for a generic information endpoint that > >> remains > >> accessible even after failure, and that it should include additional > info > >> such as > >> the checkpoint restore path and configuration. > >> > >> From my current understanding, Flink’s existing archive > mechanism—combined > >> with the HistoryServer—already provides persistent access to job-related > >> information after failure. Specifically, the existing HistoryServer > >> endpoint > >> `/jobs/:jobid/jobmanager/config` seems capable of exposing the > >> configuration > >> including checkpoint restore paths, and remains accessible after > failure. > >> On the other hand, the proposed /applications/:appid/exceptions endpoint > >> is > >> intended specifically to surface application-level exceptions that occur > >> outside > >> the job lifecycle, which will also be available through the > HistoryServer > >> after > >> failure. > >> > >> So could you help clarify whether there is a specific failure scenario > or > >> use case > >> where the current archiving/HistoryServer mechanism falls short or where > >> critical > >> debugging information—like the restore path or configuration—is not > >> retrievable > >> after a failure? > >> > >> > >> Thanks again for your excellent suggestions! > >> > >> Best, > >> Yi > >> > >> At 2025-12-25 21:08:49, "Gyula Fóra" <[email protected]> wrote: > >> >Hi! > >> > > >> >Overall I think the design/improvements look great. Some minor > comments, > >> >improvement possibilities: > >> > > >> >1. Could we simply use the job name for job matching? I think it's > fair to > >> >require unique job names (or if they are not unique attach a sequence > >> >number to the name) instead of the jobIndex parameter. JobIndex sounds > a > >> >bit weird and low level. > >> > > >> >2.A big problem/limitation of the existing submission logic is that the > >> >submit-on-error logic is very limited (only handling certain types of > >> >errors and only showing exception info). We should capture different > >> errors > >> >and metadata for failed applications including checkpoint settings (for > >> >instance what checkpoint path was used during restore, which is a > common > >> >cause of the errors). So instead of introducing a > >> >/applications/appid/exceptions endpoint, can we instead introduce a > more > >> >generic information endpoint that would contain other information? This > >> >endpoint should be accessible even in cause of failures and populated > from > >> >the app result store and should also contain some other info such as > >> >checkpoint restore path, configuration etc. > >> > > >> >Capturing more information on failed submissions would help resolve a > lot > >> >of long outstanding issues in the Flink Kubernetes Operator as well. > >> > > >> >Cheers > >> >Gyula > >> > > >> > > >> >On Thu, Dec 25, 2025 at 1:54 PM Lei Yang <[email protected]> wrote: > >> > > >> >> Thank you Yi for your reply, looks good to me! > >> >> +1 for this proposal > >> >> Best, > >> >> Lei > >> >> > >> >> Yi Zhang <[email protected]> 于2025年12月25日周四 10:02写道: > >> >> > >> >> > Hi Lei, > >> >> > > >> >> > > >> >> > Thank you for the feedback! > >> >> > The "Archiving Directory Structure" section describes a change in > how > >> >> > archived > >> >> > files are organized under jobmanager.archive.fs.dir. While this > change > >> >> was > >> >> > originally proposed in FLIP-549, it's indeed a significant > >> >> > application-level update, > >> >> > so I'm glad to have the chance to clarify it here. > >> >> > > >> >> > > >> >> > To answer your question directly: backward compatibility is fully > >> >> > preserved. > >> >> > > >> >> > > >> >> > In earlier Flink versions, job archives were written directly under > >> the > >> >> > configured > >> >> > jobmanager.archive.fs.dir. With this update, Flink will instead > use a > >> >> > hierarchical > >> >> > cluster-application-job structure. > >> >> > We understand that many users already have archives stored in the > >> legacy > >> >> > flat > >> >> > layout. To ensure a smooth transition, the History Server will be > >> updated > >> >> > to read > >> >> > archives from both the old and new directory structures. As a > result, > >> all > >> >> > previously archived jobs will remain accessible and visible. > >> >> > > >> >> > > >> >> > If you have additional questions or specific edge cases in mind, > I’d > >> be > >> >> > happy to > >> >> > discuss them further! > >> >> > > >> >> > > >> >> > Best, > >> >> > Yi > >> >> > > >> >> > > >> >> > > >> >> > At 2025-12-24 11:35:00, "Lei Yang" <[email protected]> wrote: > >> >> > >Hi Yi, > >> >> > > > >> >> > >Thank you for creating this FLIP! The introduction of the > Application > >> >> > >entity significantly enhances the observability and manageability > of > >> >> > >user logic, especially benefiting batch workloads. This is truly > >> >> > >excellent work! > >> >> > > > >> >> > >However, I have a compatibility concern and would appreciate your > >> >> > >clarification. In the “Archiving Directory Structure” section, I > >> noticed > >> >> > >that the directory structure has been changed. If users have > >> configured > >> >> > >a persistent external path for jobmanager.archive.fs.dir, will > their > >> >> > >existing archives become unreadable after this change? Will the > >> >> > >implementation of this FLIP maintain backward compatibility with > >> >> > >previously archived job data? > >> >> > > > >> >> > >Best regards, > >> >> > >Lei > >> >> > > > >> >> > >Yi Zhang <[email protected]> 于2025年12月17日周三 14:18写道: > >> >> > > > >> >> > >> Hi everyone, > >> >> > >> > >> >> > >> I would like to start a discussion about FLIP-560: Application > >> >> > Capability > >> >> > >> Enhancement [1]. > >> >> > >> > >> >> > >> The primary goal of this FLIP is to improve the usability and > >> >> > availability > >> >> > >> of Flink applications > >> >> > >> > >> >> > >> by introducing the following enhancements: > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> 1. Support multi-job execution in Application Mode, which is an > >> >> > important > >> >> > >> batch-processing use case. > >> >> > >> 2. Support re-running the user's main method after JobManager > >> restarts > >> >> > due > >> >> > >> to failures in Session Mode. > >> >> > >> 3. Expose exceptions thrown in the user's main method via > REST/UI. > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> Looking forward to your feedback and suggestions! > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> [1] > >> >> > >> > >> >> > >> > >> >> > > >> >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> Best Regards, > >> >> > >> > >> >> > >> Yi Zhang > >> >> > > >> >> > >> >
