[
https://issues.apache.org/jira/browse/FLINK-34009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805937#comment-17805937
]
Hangxiang Yu commented on FLINK-34009:
--------------------------------------
[[email protected]]
Hi, You could reorganize the info and ask for help in User mails where more
people could help with this.
Let's report in Jira after guaranting it's an issue.
> Apache flink: Checkpoint restoration issue on Application Mode of deployment
> ----------------------------------------------------------------------------
>
> Key: FLINK-34009
> URL: https://issues.apache.org/jira/browse/FLINK-34009
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.18.0
> Environment: Flink version: 1.18
> Zookeeper version: 3.7.2
> Env: Custom flink docker image (with embedded application class) deployed
> over kubernetes (v1.26.11).
> Reporter: Vijay
> Priority: Major
>
> Hi Team,
> Good Day. Wish you all a happy new year 2024.
> We are using Flink (1.18) version on our flink cluster. Job manager has been
> deployed on "Application mode" and HA is disabled (high-availability.type:
> NONE), under this configuration parameters we are able to start multiple jobs
> (using env.executeAsync()) of a single application.
> Note: We have also setup checkpoint on a s3 instance with
> RETAIN_ON_CANCELLATION mode (plus other required settings).
> Lets say now we start two jobs of the same application (ex: Jobidxxx1,
> jobidxxx2) and they are currently running on the k8s env. If we have to
> perform Flink minor upgrade (or) upgrade of our application with minor
> changes, in that case we will stop the Job Manager and Task Managers
> instances and perform the necessary up-gradation then when we start both Job
> Manager and Task Managers instance. On startup we expect the job's to be
> restored back from the last checkpoint, but the job restoration is not
> happening on Job manager startup. Please let us know if this is an bug (or)
> its the general behavior of flink under application mode of deployment.
> Additional information: If we enable HA (using Zookeeper) on Application
> mode, we are able to startup only one job (i.e., per-job behavior). When we
> perform Flink minor upgrade (or) upgrade of our application with minor
> changes, the checkpoint restoration is working properly on Job Manager & Task
> Managers restart process.
> It seems checkpoint restoration and HA are inter-related, but why checkpoint
> restoration doesn't work when HA is disabled.
>
> Please let us know if anyone has experienced similar issues or if have any
> suggestions, it will be highly appreciated. Thanks in advance for your
> assistance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)