Re: [DISCUSS] FLIP-560: Application Capability Enhancement

gyula . fora Thu, 15 Jan 2026 03:32:38 -0800

Hi!

I think a separate endpoint with the effective config makes perfect sense and 
will cover the requirements.


Thank you for including :)

Gyula

Sent from my iPhone

> On 15 Jan 2026, at 06:50, Yi Zhang <[email protected]> wrote:
> 
> Hi Gyula,
> 
> 
> I have given this some more thought, and I agree with your
> point: users (and external controllers like the Kubernetes
> Operator) need access to critical context such as the state
> recovery path, even in cases where no job has been
> successfully submitted.
> 
> 
> What if we introduce a new REST API endpoint, such
> as /applications/:applicationid/jobmanager/config,
> to expose the effective JobManager configuration used (or
> intended to be used) by the application? This could include
> key settings like state recovery path and other relevant
> configured options.
> 
> 
> It might help make the API responsibilities clearer, and also
> provide valuable visibility even when no errors occur. For
> actual error details, the
> /applications/:applicationid/exceptions endpoint can be
> used.
> 
> 
> I’d appreciate your thoughts on this approach. Thanks!
> 
> 
> Best,
> Yi
> 
> At 2026-01-12 15:23:58, "Gyula Fóra" <[email protected]> wrote:
>> Hi!
>> 
>> Overall it makes sense, I cannot reproduce a job actually
>> submitted/archived properly with invalid savepoint path (I am testing on
>> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes
>> this seems to cause a jobmanager shutdown immediately.
>> 
>> You are right, I am mostly concerned about all the error scenarios that
>> currently do not result in FAILED job submissions.
>> 
>> Regarding what error metadata to expose, I think what you write makes
>> sense, with the only specific exception that checkpoint/state recovery
>> information (what checkpoint are we restoring from at the moment) should
>> always be included in the error/job/app metadata. This is crucial for the
>> Kubernetes Operator (and possible other external control planes) to handle
>> the error. Currently since this information is sometimes lost, it leads to
>> many cornercases requiring manual intervention from users.
>> 
>> Cheers
>> Gyula
>> 
>>> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang <[email protected]> wrote:
>>> 
>>> 
>>> 
>>> Hi Gyula,
>>> 
>>> 
>>> Thank you very much for your explanation.
>>> 
>>> 
>>> "Some errors such as invalid state path are not even submitted or when it
>>> is Flink
>>> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
>>> actually include information about the checkpoint restore path/configs
>>> etc."
>>> 
>>> 
>>> 
>>> I ran some tests on the latest Flink 2.3 release by submitting a job with
>>> an invalid
>>> `execution.state-recovery.path`. The job submission itself succeeded, but
>>> the job
>>> failed during initialization. It seems that at least for misconfigured
>>> state recovery
>>> paths, the job still goes through submission and gets archived with
>>> sufficient
>>> diagnostic info. Am I missing anything here? If there’s a specific Jira
>>> issue
>>> describing such a scenario, it would be great to reference it for more
>>> concrete
>>> discussion around this requirements.
>>> 
>>> 
>>> That said, I agree that there are different error scenarios we might
>>> encounter, which
>>> broadly fall into three categories:
>>> 
>>> 1. Failures after successful job submission, which result in a FAILED job
>>> state. In
>>> these cases, relevant diagnostics are already accessible via existing
>>> job-related
>>> REST endpoints.
>>> 2. Failures during job submission, leaving no concrete job entity to query.
>>> 3. Failures in the main() method unrelated to job submission/execution.
>>> 
>>> The originally proposed /applications/:applicationid/exceptions endpoint
>>> is intended
>>> to expose exceptions from all three categories. From my understanding, your
>>> primary interest lies in scenario #2, where additional context could help
>>> diagnose
>>> why submission failed, even though no real job was created.
>>> 
>>> 
>>> Rather than introducing a general endpoint that exposes all possible
>>> configuration
>>> and metadata, would it be more practical to conditionally enrich
>>> exceptions? For
>>> example, when a submission fails due to invalid state paths or
>>> misconfigured
>>> options, we could attach the relevant configuration settings. This
>>> approach would
>>> complement the /applications/:applicationid/exceptions design and allow us
>>> to
>>> incrementally evolve toward richer diagnostics over time.
>>> Having a concrete use case would greatly help align on the scope and
>>> implementation details of such enrichment.
>>> 
>>> 
>>> Thanks again for your valuable feedback and suggestions!
>>> 
>>> 
>>> Best,
>>> Yi
>>> 
>>> P.S. I’ve updated the FLIP to reflect the change regarding using job name
>>> for job
>>> matching. Please let me know if you have any further questions or
>>> suggestions.
>>> 
>>> 
>>> At 2026-01-08 17:07:16, "Gyula Fóra" <[email protected]> wrote:
>>>> Hi Yi!
>>>> 
>>>> Sorry for the late reply, I somehow missed your response:
>>>> 
>>>> "Flink’s existing archive mechanism—combined
>>>> with the HistoryServer—already provides persistent access to job-related
>>>> information after failure. Specifically, the existing HistoryServer
>>> endpoint
>>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
>>> configuration
>>>> 
>>>> including checkpoint restore paths, and remains accessible after failure."
>>>> 
>>>> You are right, when a job fails this is true we can see the past
>>> checkpoint
>>>> history etc. But I think this doesn't apply for jobs that faile during
>>>> submission or in the main method. Some errors such as invalid state path
>>>> are not even submitted or when it is Flink uses
>>>> ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
>>>> actually include information about the checkpoint restore path/configs
>>> etc.
>>>> 
>>>> Cheers
>>>> Gyula
>>>> 
>>>> On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang <[email protected]> wrote:
>>>> 
>>>>> Hi Gyula,
>>>>> 
>>>>> 
>>>>> Thank you so much for your thoughtful and insightful feedback!
>>>>> 
>>>>> 
>>>>> 1.  I fully agree that using the job name for job matching is more
>>>>> user-friendly and
>>>>> cleaner than relying on a jobIndex parameter. I’ll update the FLIP
>>>>> accordingly
>>>>> to reflect this design change.
>>>>> 
>>>>> 
>>>>> 2. I’d like to dig a bit deeper to make sure I fully understand the
>>>>> requirement.
>>>>> You have mentioned the need for a generic information endpoint that
>>>>> remains
>>>>> accessible even after failure, and that it should include additional
>>> info
>>>>> such as
>>>>> the checkpoint restore path and configuration.
>>>>> 
>>>>> From my current understanding, Flink’s existing archive
>>> mechanism—combined
>>>>> with the HistoryServer—already provides persistent access to job-related
>>>>> information after failure. Specifically, the existing HistoryServer
>>>>> endpoint
>>>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
>>>>> configuration
>>>>> including checkpoint restore paths, and remains accessible after
>>> failure.
>>>>> On the other hand, the proposed /applications/:appid/exceptions endpoint
>>>>> is
>>>>> intended specifically to surface application-level exceptions that occur
>>>>> outside
>>>>> the job lifecycle, which will also be available through the
>>> HistoryServer
>>>>> after
>>>>> failure.
>>>>> 
>>>>> So could you help clarify whether there is a specific failure scenario
>>> or
>>>>> use case
>>>>> where the current archiving/HistoryServer mechanism falls short or where
>>>>> critical
>>>>> debugging information—like the restore path or configuration—is not
>>>>> retrievable
>>>>> after a failure?
>>>>> 
>>>>> 
>>>>> Thanks again for your excellent suggestions!
>>>>> 
>>>>> Best,
>>>>> Yi
>>>>> 
>>>>> At 2025-12-25 21:08:49, "Gyula Fóra" <[email protected]> wrote:
>>>>>> Hi!
>>>>>> 
>>>>>> Overall I think the design/improvements look great. Some minor
>>> comments,
>>>>>> improvement possibilities:
>>>>>> 
>>>>>> 1. Could we simply use the job name for job matching? I think it's
>>> fair to
>>>>>> require unique job names (or if they are not unique attach a sequence
>>>>>> number to the name) instead of the jobIndex parameter. JobIndex sounds
>>> a
>>>>>> bit weird and low level.
>>>>>> 
>>>>>> 2.A big problem/limitation of the existing submission logic is that the
>>>>>> submit-on-error logic is very limited (only handling certain types of
>>>>>> errors and only showing exception info). We should capture different
>>>>> errors
>>>>>> and metadata for failed applications including checkpoint settings (for
>>>>>> instance what checkpoint path was used during restore, which is a
>>> common
>>>>>> cause of the errors). So instead of introducing a
>>>>>> /applications/appid/exceptions endpoint, can we instead introduce a
>>> more
>>>>>> generic information endpoint that would contain other information? This
>>>>>> endpoint should be accessible even in cause of failures and populated
>>> from
>>>>>> the app result store and should also contain some other info such as
>>>>>> checkpoint restore path, configuration etc.
>>>>>> 
>>>>>> Capturing more information on failed submissions would help resolve a
>>> lot
>>>>>> of long outstanding issues in the Flink Kubernetes Operator as well.
>>>>>> 
>>>>>> Cheers
>>>>>> Gyula
>>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 25, 2025 at 1:54 PM Lei Yang <[email protected]> wrote:
>>>>>> 
>>>>>>> Thank you Yi for your reply, looks good to me!
>>>>>>> +1 for this proposal
>>>>>>> Best,
>>>>>>> Lei
>>>>>>> 
>>>>>>> Yi Zhang <[email protected]> 于2025年12月25日周四 10:02写道：
>>>>>>> 
>>>>>>>> Hi Lei,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thank you for the feedback!
>>>>>>>> The "Archiving Directory Structure" section describes a change in
>>> how
>>>>>>>> archived
>>>>>>>> files are organized under jobmanager.archive.fs.dir. While this
>>> change
>>>>>>> was
>>>>>>>> originally proposed in FLIP-549, it's indeed a significant
>>>>>>>> application-level update,
>>>>>>>> so I'm glad to have the chance to clarify it here.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> To answer your question directly: backward compatibility is fully
>>>>>>>> preserved.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> In earlier Flink versions, job archives were written directly under
>>>>> the
>>>>>>>> configured
>>>>>>>> jobmanager.archive.fs.dir. With this update, Flink will instead
>>> use a
>>>>>>>> hierarchical
>>>>>>>> cluster-application-job structure.
>>>>>>>> We understand that many users already have archives stored in the
>>>>> legacy
>>>>>>>> flat
>>>>>>>> layout. To ensure a smooth transition, the History Server will be
>>>>> updated
>>>>>>>> to read
>>>>>>>> archives from both the old and new directory structures. As a
>>> result,
>>>>> all
>>>>>>>> previously archived jobs will remain accessible and visible.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> If you have additional questions or specific edge cases in mind,
>>> I’d
>>>>> be
>>>>>>>> happy to
>>>>>>>> discuss them further!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Yi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> At 2025-12-24 11:35:00, "Lei Yang" <[email protected]> wrote:
>>>>>>>>> Hi Yi,
>>>>>>>>> 
>>>>>>>>> Thank you for creating this FLIP! The introduction of the
>>> Application
>>>>>>>>> entity significantly enhances the observability and manageability
>>> of
>>>>>>>>> user logic, especially benefiting batch workloads. This is truly
>>>>>>>>> excellent work!
>>>>>>>>> 
>>>>>>>>> However, I have a compatibility concern and would appreciate your
>>>>>>>>> clarification. In the “Archiving Directory Structure” section, I
>>>>> noticed
>>>>>>>>> that the directory structure has been changed. If users have
>>>>> configured
>>>>>>>>> a persistent external path for jobmanager.archive.fs.dir, will
>>> their
>>>>>>>>> existing archives become unreadable after this change? Will the
>>>>>>>>> implementation of this FLIP maintain backward compatibility with
>>>>>>>>> previously archived job data?
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Lei
>>>>>>>>> 
>>>>>>>>> Yi Zhang <[email protected]> 于2025年12月17日周三 14:18写道：
>>>>>>>>> 
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> I would like to start a discussion about FLIP-560: Application
>>>>>>>> Capability
>>>>>>>>>> Enhancement [1].
>>>>>>>>>> 
>>>>>>>>>> The primary goal of this FLIP is to improve the usability and
>>>>>>>> availability
>>>>>>>>>> of Flink applications
>>>>>>>>>> 
>>>>>>>>>> by introducing the following enhancements:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 1. Support multi-job execution in Application Mode, which is an
>>>>>>>> important
>>>>>>>>>> batch-processing    use case.
>>>>>>>>>> 2. Support re-running the user's main method after JobManager
>>>>> restarts
>>>>>>>> due
>>>>>>>>>> to failures in    Session Mode.
>>>>>>>>>> 3. Expose exceptions thrown in the user's main method via
>>> REST/UI.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Looking forward to your feedback and suggestions!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best Regards,
>>>>>>>>>> 
>>>>>>>>>> Yi Zhang
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

Reply via email to