Re: [DISCUSS] FLIP-560: Application Capability Enhancement

Yi Zhang Thu, 15 Jan 2026 02:43:13 -0800

Hi!


Thanks so much for your feedback and suggestions. I have updated the 
FLIP accordingly and truly appreciate your input!


Best,
Yi



At 2026-01-15 14:20:53, [email protected] wrote:
>Hi!
>
>I think a separate endpoint with the effective config makes perfect sense and 
>will cover the requirements.
>
>Thank you for including :)
>
>Gyula
>
>Sent from my iPhone
>
>> On 15 Jan 2026, at 06:50, Yi Zhang <[email protected]> wrote:
>> 
>> Hi Gyula,
>> 
>> 
>> I have given this some more thought, and I agree with your
>> point: users (and external controllers like the Kubernetes
>> Operator) need access to critical context such as the state
>> recovery path, even in cases where no job has been
>> successfully submitted.
>> 
>> 
>> What if we introduce a new REST API endpoint, such
>> as /applications/:applicationid/jobmanager/config,
>> to expose the effective JobManager configuration used (or
>> intended to be used) by the application? This could include
>> key settings like state recovery path and other relevant
>> configured options.
>> 
>> 
>> It might help make the API responsibilities clearer, and also
>> provide valuable visibility even when no errors occur. For
>> actual error details, the
>> /applications/:applicationid/exceptions endpoint can be
>> used.
>> 
>> 
>> I’d appreciate your thoughts on this approach. Thanks!
>> 
>> 
>> Best,
>> Yi
>> 
>> At 2026-01-12 15:23:58, "Gyula Fóra" <[email protected]> wrote:
>>> Hi!
>>> 
>>> Overall it makes sense, I cannot reproduce a job actually
>>> submitted/archived properly with invalid savepoint path (I am testing on
>>> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes
>>> this seems to cause a jobmanager shutdown immediately.
>>> 
>>> You are right, I am mostly concerned about all the error scenarios that
>>> currently do not result in FAILED job submissions.
>>> 
>>> Regarding what error metadata to expose, I think what you write makes
>>> sense, with the only specific exception that checkpoint/state recovery
>>> information (what checkpoint are we restoring from at the moment) should
>>> always be included in the error/job/app metadata. This is crucial for the
>>> Kubernetes Operator (and possible other external control planes) to handle
>>> the error. Currently since this information is sometimes lost, it leads to
>>> many cornercases requiring manual intervention from users.
>>> 
>>> Cheers
>>> Gyula
>>> 
>>>> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Gyula,
>>>> 
>>>> 
>>>> Thank you very much for your explanation.
>>>> 
>>>> 
>>>> "Some errors such as invalid state path are not even submitted or when it
>>>> is Flink
>>>> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
>>>> actually include information about the checkpoint restore path/configs
>>>> etc."
>>>> 
>>>> 
>>>> 
>>>> I ran some tests on the latest Flink 2.3 release by submitting a job with
>>>> an invalid
>>>> `execution.state-recovery.path`. The job submission itself succeeded, but
>>>> the job
>>>> failed during initialization. It seems that at least for misconfigured
>>>> state recovery
>>>> paths, the job still goes through submission and gets archived with
>>>> sufficient
>>>> diagnostic info. Am I missing anything here? If there’s a specific Jira
>>>> issue
>>>> describing such a scenario, it would be great to reference it for more
>>>> concrete
>>>> discussion around this requirements.
>>>> 
>>>> 
>>>> That said, I agree that there are different error scenarios we might
>>>> encounter, which
>>>> broadly fall into three categories:
>>>> 
>>>> 1. Failures after successful job submission, which result in a FAILED job
>>>> state. In
>>>> these cases, relevant diagnostics are already accessible via existing
>>>> job-related
>>>> REST endpoints.
>>>> 2. Failures during job submission, leaving no concrete job entity to query.
>>>> 3. Failures in the main() method unrelated to job submission/execution.
>>>> 
>>>> The originally proposed /applications/:applicationid/exceptions endpoint
>>>> is intended
>>>> to expose exceptions from all three categories. From my understanding, your
>>>> primary interest lies in scenario #2, where additional context could help
>>>> diagnose
>>>> why submission failed, even though no real job was created.
>>>> 
>>>> 
>>>> Rather than introducing a general endpoint that exposes all possible
>>>> configuration
>>>> and metadata, would it be more practical to conditionally enrich
>>>> exceptions? For
>>>> example, when a submission fails due to invalid state paths or
>>>> misconfigured
>>>> options, we could attach the relevant configuration settings. This
>>>> approach would
>>>> complement the /applications/:applicationid/exceptions design and allow us
>>>> to
>>>> incrementally evolve toward richer diagnostics over time.
>>>> Having a concrete use case would greatly help align on the scope and
>>>> implementation details of such enrichment.
>>>> 
>>>> 
>>>> Thanks again for your valuable feedback and suggestions!
>>>> 
>>>> 
>>>> Best,
>>>> Yi
>>>> 
>>>> P.S. I’ve updated the FLIP to reflect the change regarding using job name
>>>> for job
>>>> matching. Please let me know if you have any further questions or
>>>> suggestions.
>>>> 
>>>> 
>>>> At 2026-01-08 17:07:16, "Gyula Fóra" <[email protected]> wrote:
>>>>> Hi Yi!
>>>>> 
>>>>> Sorry for the late reply, I somehow missed your response:
>>>>> 
>>>>> "Flink’s existing archive mechanism—combined
>>>>> with the HistoryServer—already provides persistent access to job-related
>>>>> information after failure. Specifically, the existing HistoryServer
>>>> endpoint
>>>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
>>>> configuration
>>>>> 
>>>>> including checkpoint restore paths, and remains accessible after failure."
>>>>> 
>>>>> You are right, when a job fails this is true we can see the past
>>>> checkpoint
>>>>> history etc. But I think this doesn't apply for jobs that faile during
>>>>> submission or in the main method. Some errors such as invalid state path
>>>>> are not even submitted or when it is Flink uses
>>>>> ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
>>>>> actually include information about the checkpoint restore path/configs
>>>> etc.
>>>>> 
>>>>> Cheers
>>>>> Gyula
>>>>> 
>>>>> On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang <[email protected]> wrote:
>>>>> 
>>>>>> Hi Gyula,
>>>>>> 
>>>>>> 
>>>>>> Thank you so much for your thoughtful and insightful feedback!
>>>>>> 
>>>>>> 
>>>>>> 1.  I fully agree that using the job name for job matching is more
>>>>>> user-friendly and
>>>>>> cleaner than relying on a jobIndex parameter. I’ll update the FLIP
>>>>>> accordingly
>>>>>> to reflect this design change.
>>>>>> 
>>>>>> 
>>>>>> 2. I’d like to dig a bit deeper to make sure I fully understand the
>>>>>> requirement.
>>>>>> You have mentioned the need for a generic information endpoint that
>>>>>> remains
>>>>>> accessible even after failure, and that it should include additional
>>>> info
>>>>>> such as
>>>>>> the checkpoint restore path and configuration.
>>>>>> 
>>>>>> From my current understanding, Flink’s existing archive
>>>> mechanism—combined
>>>>>> with the HistoryServer—already provides persistent access to job-related
>>>>>> information after failure. Specifically, the existing HistoryServer
>>>>>> endpoint
>>>>>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
>>>>>> configuration
>>>>>> including checkpoint restore paths, and remains accessible after
>>>> failure.
>>>>>> On the other hand, the proposed /applications/:appid/exceptions endpoint
>>>>>> is
>>>>>> intended specifically to surface application-level exceptions that occur
>>>>>> outside
>>>>>> the job lifecycle, which will also be available through the
>>>> HistoryServer
>>>>>> after
>>>>>> failure.
>>>>>> 
>>>>>> So could you help clarify whether there is a specific failure scenario
>>>> or
>>>>>> use case
>>>>>> where the current archiving/HistoryServer mechanism falls short or where
>>>>>> critical
>>>>>> debugging information—like the restore path or configuration—is not
>>>>>> retrievable
>>>>>> after a failure?
>>>>>> 
>>>>>> 
>>>>>> Thanks again for your excellent suggestions!
>>>>>> 
>>>>>> Best,
>>>>>> Yi
>>>>>> 
>>>>>> At 2025-12-25 21:08:49, "Gyula Fóra" <[email protected]> wrote:
>>>>>>> Hi!
>>>>>>> 
>>>>>>> Overall I think the design/improvements look great. Some minor
>>>> comments,
>>>>>>> improvement possibilities:
>>>>>>> 
>>>>>>> 1. Could we simply use the job name for job matching? I think it's
>>>> fair to
>>>>>>> require unique job names (or if they are not unique attach a sequence
>>>>>>> number to the name) instead of the jobIndex parameter. JobIndex sounds
>>>> a
>>>>>>> bit weird and low level.
>>>>>>> 
>>>>>>> 2.A big problem/limitation of the existing submission logic is that the
>>>>>>> submit-on-error logic is very limited (only handling certain types of
>>>>>>> errors and only showing exception info). We should capture different
>>>>>> errors
>>>>>>> and metadata for failed applications including checkpoint settings (for
>>>>>>> instance what checkpoint path was used during restore, which is a
>>>> common
>>>>>>> cause of the errors). So instead of introducing a
>>>>>>> /applications/appid/exceptions endpoint, can we instead introduce a
>>>> more
>>>>>>> generic information endpoint that would contain other information? This
>>>>>>> endpoint should be accessible even in cause of failures and populated
>>>> from
>>>>>>> the app result store and should also contain some other info such as
>>>>>>> checkpoint restore path, configuration etc.
>>>>>>> 
>>>>>>> Capturing more information on failed submissions would help resolve a
>>>> lot
>>>>>>> of long outstanding issues in the Flink Kubernetes Operator as well.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Gyula
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Dec 25, 2025 at 1:54 PM Lei Yang <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Thank you Yi for your reply, looks good to me!
>>>>>>>> +1 for this proposal
>>>>>>>> Best,
>>>>>>>> Lei
>>>>>>>> 
>>>>>>>> Yi Zhang <[email protected]> 于2025年12月25日周四 10:02写道：
>>>>>>>> 
>>>>>>>>> Hi Lei,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you for the feedback!
>>>>>>>>> The "Archiving Directory Structure" section describes a change in
>>>> how
>>>>>>>>> archived
>>>>>>>>> files are organized under jobmanager.archive.fs.dir. While this
>>>> change
>>>>>>>> was
>>>>>>>>> originally proposed in FLIP-549, it's indeed a significant
>>>>>>>>> application-level update,
>>>>>>>>> so I'm glad to have the chance to clarify it here.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> To answer your question directly: backward compatibility is fully
>>>>>>>>> preserved.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> In earlier Flink versions, job archives were written directly under
>>>>>> the
>>>>>>>>> configured
>>>>>>>>> jobmanager.archive.fs.dir. With this update, Flink will instead
>>>> use a
>>>>>>>>> hierarchical
>>>>>>>>> cluster-application-job structure.
>>>>>>>>> We understand that many users already have archives stored in the
>>>>>> legacy
>>>>>>>>> flat
>>>>>>>>> layout. To ensure a smooth transition, the History Server will be
>>>>>> updated
>>>>>>>>> to read
>>>>>>>>> archives from both the old and new directory structures. As a
>>>> result,
>>>>>> all
>>>>>>>>> previously archived jobs will remain accessible and visible.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If you have additional questions or specific edge cases in mind,
>>>> I’d
>>>>>> be
>>>>>>>>> happy to
>>>>>>>>> discuss them further!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Yi
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> At 2025-12-24 11:35:00, "Lei Yang" <[email protected]> wrote:
>>>>>>>>>> Hi Yi,
>>>>>>>>>> 
>>>>>>>>>> Thank you for creating this FLIP! The introduction of the
>>>> Application
>>>>>>>>>> entity significantly enhances the observability and manageability
>>>> of
>>>>>>>>>> user logic, especially benefiting batch workloads. This is truly
>>>>>>>>>> excellent work!
>>>>>>>>>> 
>>>>>>>>>> However, I have a compatibility concern and would appreciate your
>>>>>>>>>> clarification. In the “Archiving Directory Structure” section, I
>>>>>> noticed
>>>>>>>>>> that the directory structure has been changed. If users have
>>>>>> configured
>>>>>>>>>> a persistent external path for jobmanager.archive.fs.dir, will
>>>> their
>>>>>>>>>> existing archives become unreadable after this change? Will the
>>>>>>>>>> implementation of this FLIP maintain backward compatibility with
>>>>>>>>>> previously archived job data?
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> Lei
>>>>>>>>>> 
>>>>>>>>>> Yi Zhang <[email protected]> 于2025年12月17日周三 14:18写道：
>>>>>>>>>> 
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>> 
>>>>>>>>>>> I would like to start a discussion about FLIP-560: Application
>>>>>>>>> Capability
>>>>>>>>>>> Enhancement [1].
>>>>>>>>>>> 
>>>>>>>>>>> The primary goal of this FLIP is to improve the usability and
>>>>>>>>> availability
>>>>>>>>>>> of Flink applications
>>>>>>>>>>> 
>>>>>>>>>>> by introducing the following enhancements:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 1. Support multi-job execution in Application Mode, which is an
>>>>>>>>> important
>>>>>>>>>>> batch-processing    use case.
>>>>>>>>>>> 2. Support re-running the user's main method after JobManager
>>>>>> restarts
>>>>>>>>> due
>>>>>>>>>>> to failures in    Session Mode.
>>>>>>>>>>> 3. Expose exceptions thrown in the user's main method via
>>>> REST/UI.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Looking forward to your feedback and suggestions!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> 
>>>>>>>>>>> Yi Zhang
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

Reply via email to