[jira] [Commented] (FLINK-19154) Application mode deletes HA data in case of suspended ZooKeeper connection

Cristian (Jira) Tue, 27 Oct 2020 11:35:24 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221652#comment-17221652
 ]


Cristian commented on FLINK-19154:
----------------------------------

When are you guys planning to release these changes?

I tried using them (i.e. building a Flink docker image with these changes) but 
hit a wall: these changes are not backwards compatible.

The changes to the `flink-clients` (moving the AbstractDispatcherBootstrap 
class) mean that not only I need to upgrade the Flink cluster but also re 
compile all my jobs against these new changes. Since these changes are not in 
Maven yet, I also need to publish the jars to my own repository, etc.

What's more, since this is not backwards compatible, I guess it makes no sense 
to make it part of the 1.11 branch?

> Application mode deletes HA data in case of suspended ZooKeeper connection
> --------------------------------------------------------------------------
>
>                 Key: FLINK-19154
>                 URL: https://issues.apache.org/jira/browse/FLINK-19154
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.12.0, 1.11.1
>         Environment: Run a stand-alone cluster that runs a single job (if you 
> are familiar with the way Ververica Platform runs Flink jobs, we use a very 
> similar approach). It runs Flink 1.11.1 straight from the official docker 
> image.
>            Reporter: Husky Zeng
>            Assignee: Kostas Kloudas
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.12.0, 1.11.3
>
>
> A user reported that Flink's application mode deletes HA data in case of a 
> suspended ZooKeeper connection [1]. 
> The problem seems to be that the {{ApplicationDispatcherBootstrap}} class 
> produces an exception (that the request job can no longer be found because of 
> a lost ZooKeeper connection) which will be interpreted as a job failure. Due 
> to this interpretation, the cluster will be shut down with a terminal state 
> of FAILED which will cause the HA data to be cleaned up. The exact problem 
> occurs in the {{JobStatusPollingUtils.getJobResult}} which is called by 
> {{ApplicationDispatcherBootstrap.getJobResult()}}.
> The above described behaviour can be found in this log [2].
> [1] 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html
> [2] https://pastebin.com/raw/uH9KDU2L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19154) Application mode deletes HA data in case of suspended ZooKeeper connection

Reply via email to