[ 
https://issues.apache.org/jira/browse/FLINK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zili Chen closed FLINK-14685.
-----------------------------
    Fix Version/s:     (was: 1.10.1)
                       (was: 1.11.0)
       Resolution: Duplicate

> ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-14685
>                 URL: https://issues.apache.org/jira/browse/FLINK-14685
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zili Chen
>            Priority: Major
>
> Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. 
> connection loss, it will set the state as invalid so that all checkpoint id 
> counter operations succeed will fail.
> Although couple with JM leadership management we will generate a new id 
> counter on re-granted leadership so that it is not a problem so far, the 
> semantic is wrong because id counter should only check whether current state 
> is SUSPENDED/LOST. 
> It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in 
> {{LeaderLatch}}. [~lamber-ken] provides a 
> [fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
>  there.
> Besides, in product scenario we once noticed that JM didn't re-elected(it 
> shouldn't happen after [~trohrmann] add linearized leader operation) on 
> SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
> I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue 
> and fix this wrong semantic.
> CC [~GJL] [~azagrebin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to