[
https://issues.apache.org/jira/browse/FLINK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203246#comment-17203246
]
Till Rohrmann edited comment on FLINK-19154 at 9/28/20, 1:58 PM:
-----------------------------------------------------------------
[~aljoscha], [~kkl0u], there seems to be a problem with the application mode
and the failure handling. I believe that some framework errors are treated as a
proper Flink job failure which leads to the deletion of HA data even though one
would like to keep this data. Could you take care of this problem?
I think there are two problems here: First of all not every exception bubbling
up in the future returned by
{{ApplicationDispatcherBootstrap.fixJobIdAndRunApplicationAsync()}} indicates a
job failure. Some of them can also indicate a framework failure which should
not lead to the clean up of HA data. The other problem is that the polling
logic cannot properly handle a temporary connection loss to ZooKeeper which is
a normal situation.
was (Author: till.rohrmann):
[~aljoscha], [~kkl0u], there seems to be a problem with the application mode
and the failure handling. I believe that some framework errors are treated as a
proper Flink job failure which leads to the deletion of HA data even though one
would like to keep this data. Could you take care of this problem?
> Application mode deletes HA data in case of suspended ZooKeeper connection
> --------------------------------------------------------------------------
>
> Key: FLINK-19154
> URL: https://issues.apache.org/jira/browse/FLINK-19154
> Project: Flink
> Issue Type: Bug
> Components: Client / Job Submission
> Affects Versions: 1.12.0, 1.11.1
> Environment: Run a stand-alone cluster that runs a single job (if you
> are familiar with the way Ververica Platform runs Flink jobs, we use a very
> similar approach). It runs Flink 1.11.1 straight from the official docker
> image.
> Reporter: Husky Zeng
> Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> A user reported that Flink's application mode deletes HA data in case of a
> suspended ZooKeeper connection [1].
> The problem seems to be that the {{ApplicationDispatcherBootstrap}} class
> produces an exception (that the request job can no longer be found because of
> a lost ZooKeeper connection) which will be interpreted as a job failure. Due
> to this interpretation, the cluster will be shut down with a terminal state
> of FAILED which will cause the HA data to be cleaned up. The exact problem
> occurs in the {{JobStatusPollingUtils.getJobResult}} which is called by
> {{ApplicationDispatcherBootstrap.getJobResult()}}.
> The above described behaviour can be found in this log [2].
> [1]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html
> [2] https://pastebin.com/raw/uH9KDU2L
--
This message was sent by Atlassian Jira
(v8.3.4#803005)