[
https://issues.apache.org/jira/browse/FLINK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zili Chen closed FLINK-14685.
-----------------------------
Fix Version/s: (was: 1.10.1)
(was: 1.11.0)
Resolution: Duplicate
> ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK
> ---------------------------------------------------------------------------
>
> Key: FLINK-14685
> URL: https://issues.apache.org/jira/browse/FLINK-14685
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Zili Chen
> Priority: Major
>
> Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e.
> connection loss, it will set the state as invalid so that all checkpoint id
> counter operations succeed will fail.
> Although couple with JM leadership management we will generate a new id
> counter on re-granted leadership so that it is not a problem so far, the
> semantic is wrong because id counter should only check whether current state
> is SUSPENDED/LOST.
> It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in
> {{LeaderLatch}}. [~lamber-ken] provides a
> [fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
> there.
> Besides, in product scenario we once noticed that JM didn't re-elected(it
> shouldn't happen after [~trohrmann] add linearized leader operation) on
> SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
> I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue
> and fix this wrong semantic.
> CC [~GJL] [~azagrebin]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)