Zili Chen created FLINK-14685:
---------------------------------

             Summary: ZooKeeperCheckpointIDCounter forever broken if once loss 
connection with ZK
                 Key: FLINK-14685
                 URL: https://issues.apache.org/jira/browse/FLINK-14685
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing, Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: Zili Chen
             Fix For: 1.10.0


Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. 
connection loss, it will set the state as invalid so that all checkpoint id 
counter operations succeed will fail.

Although couple with JM leadership management we will generate a new id counter 
on re-granted leadership so that it is not a problem so far, the semantic is 
wrong because id counter should only check whether current state is 
SUSPENDED/LOST. 

It is also a blocker upgrading to Curator 4.2 and [~lamber-ken] provides a 
[fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
 there.

Besides, in product scenario we once noticed that JM didn't re-elected(it 
shouldn't happen after [~trohrmann] add linearized leader operation) on 
SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.

I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue 
and fix this wrong semantic.

CC [~GJL] [~azagrebin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to