[
https://issues.apache.org/jira/browse/FLINK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204168#comment-17204168
]
Cristian commented on FLINK-19154:
----------------------------------
Hello guys! Good timing. This happened again yesterday. And it happened around
the time one of our zookeeper nodes restarted (typical kubernetes shuffling, so
not a ZK issue). I would be super happy to provide more details.
One interesting but challenging characteristic of this bug is that this only
affected one of the jobs out of more than 40 we run. The other jobs just
restarted, but their state was preserved.
But for one of the jobs we were unlucky and the job manager wiped out the state
out of ZK. Pretty much the same logs as stated in this ticket.
> Application mode deletes HA data in case of suspended ZooKeeper connection
> --------------------------------------------------------------------------
>
> Key: FLINK-19154
> URL: https://issues.apache.org/jira/browse/FLINK-19154
> Project: Flink
> Issue Type: Bug
> Components: Client / Job Submission
> Affects Versions: 1.12.0, 1.11.1
> Environment: Run a stand-alone cluster that runs a single job (if you
> are familiar with the way Ververica Platform runs Flink jobs, we use a very
> similar approach). It runs Flink 1.11.1 straight from the official docker
> image.
> Reporter: Husky Zeng
> Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> A user reported that Flink's application mode deletes HA data in case of a
> suspended ZooKeeper connection [1].
> The problem seems to be that the {{ApplicationDispatcherBootstrap}} class
> produces an exception (that the request job can no longer be found because of
> a lost ZooKeeper connection) which will be interpreted as a job failure. Due
> to this interpretation, the cluster will be shut down with a terminal state
> of FAILED which will cause the HA data to be cleaned up. The exact problem
> occurs in the {{JobStatusPollingUtils.getJobResult}} which is called by
> {{ApplicationDispatcherBootstrap.getJobResult()}}.
> The above described behaviour can be found in this log [2].
> [1]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html
> [2] https://pastebin.com/raw/uH9KDU2L
--
This message was sent by Atlassian Jira
(v8.3.4#803005)