[
https://issues.apache.org/jira/browse/FLINK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yu Li updated FLINK-14091:
--------------------------
Fix Version/s: 1.10.0
Seems to be something we should try to fix in 1.10.0
> Job can not trigger checkpoint forever after zookeeper change leader
> ---------------------------------------------------------------------
>
> Key: FLINK-14091
> URL: https://issues.apache.org/jira/browse/FLINK-14091
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.9.0
> Reporter: Peng Wang
> Assignee: Zili Chen
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.10.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> when zk change leader, the state of curator is suspended,job manager can not
> tigger checkpoint.but it doesn't tigger checkpoint after zk resume.
> we found that the lastState in the class ZooKeeperCheckpointIDCounter never
> change back to normal when it fall into SUSPENDED or LOST.
> h6. _/**_
> _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED}
> or {@link_
> _* ConnectionState#LOST} we are not guaranteed to read a current count from
> ZooKeeper._
> _*/_
> _private static class SharedCountConnectionStateListener implements
> ConnectionStateListener {_
> _private volatile ConnectionState lastState;_
> _@Override_
> _public void stateChanged(CuratorFramework client, ConnectionState newState)
> {_
> _if (newState == ConnectionState.SUSPENDED || newState ==
> ConnectionState.LOST) {_
> _lastState = newState;_
> _}_
> _}_
> _private ConnectionState getLastState() {_
> _return lastState;_
> _}_
> _}_
>
> we change the state back. after test, solve the problem.
>
> h6. _/**_
> _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED}
> or {@link_
> _* ConnectionState#LOST} we are not guaranteed to read a current count from
> ZooKeeper._
> _*/_
> _private static class SharedCountConnectionStateListener implements
> ConnectionStateListener {_
> _private volatile ConnectionState lastState;_
> _@Override_
> _public void stateChanged(CuratorFramework client, ConnectionState newState)
> {_
> _if (newState == ConnectionState.SUSPENDED || newState ==
> ConnectionState.LOST) {_
> _lastState = newState;_
> _}_
> _else{_
> _/* if connectionState is not SUSPENDED and LOST, reset lastState. */_
> _lastState = null;_
> _}_
> _}_
> _private ConnectionState getLastState() {_
> _return lastState;_
> _}_
> _}_
>
> log:
> h6. {{{{2019-09-16 13:38:38,020 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable
> to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e,
> likely server has closed socket, closing socket connection and attempting
> reconnect}}}}{{{{2019-09-16 13:38:38,122 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: SUSPENDED}}}}{{{{2019-09-16 13:38:38,123 WARN
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore -
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not
> monitored (temporarily).}}}}{{{{2019-09-16 13:38:38,128 WARN
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Connection to ZooKeeper suspended. The contender
> akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/dispatcher}} {{no longer
> participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128
> WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
> - Connection to ZooKeeper suspended. The contender
> akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/resourcemanager}} {{no
> longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16
> 13:38:38,128 WARN
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Connection to ZooKeeper suspended. The contender
> http:}}{{//node007224}}{{:8081 no longer participates }}{{in}} {{the leader
> election.}}}}{{{{2019-09-16 13:38:38,128 WARN
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Connection to ZooKeeper suspended. The contender
> akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/jobmanager_2}} {{no longer
> participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128
> WARN
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from
> ZooKeeper.}}}}{{{{2019-09-16 13:38:39,109 WARN
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL
> configuration failed: javax.security.auth.login.LoginException: No JAAS
> configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified
> JAAS configuration }}{{file}}{{:
> }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}}
> {{connection to Zookeeper server without SASL authentication, }}{{if}}
> {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,109 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening
> socket connection to server
> 192.168.7.231}}{{/192}}{{.168.7.231:2181}}}}{{{{2019-09-16 13:38:39,109 ERROR
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState -
> Authentication failed}}}}{{{{2019-09-16 13:38:39,110 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket
> connection established to 192.168.7.231}}{{/192}}{{.168.7.231:2181,
> initiating session}}}}{{{{2019-09-16 13:38:39,112 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable
> to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e,
> likely server has closed socket, closing socket connection and attempting
> reconnect}}}}{{{{2019-09-16 13:38:39,778 WARN
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL
> configuration failed: javax.security.auth.login.LoginException: No JAAS
> configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified
> JAAS configuration }}{{file}}{{:
> }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}}
> {{connection to Zookeeper server without SASL authentication, }}{{if}}
> {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,778 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening
> socket connection to server
> 192.168.7.230}}{{/192}}{{.168.7.230:2181}}}}{{{{2019-09-16 13:38:39,778 ERROR
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState -
> Authentication failed}}}}{{{{2019-09-16 13:38:39,778 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket
> connection established to 192.168.7.230}}{{/192}}{{.168.7.230:2181,
> initiating session}}}}{{{{2019-09-16 13:38:39,780 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session
> establishment complete on server 192.168.7.230}}{{/192}}{{.168.7.230:2181,
> sessionid = 0x26cff6487c2000e, negotiated timeout = 60000}}}}{{{{2019-09-16
> 13:38:39,780 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: RECONNECTED}}}}{{{{2019-09-16 13:38:39,780 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Connection to ZooKeeper was reconnected. Leader election can be
> restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore -
> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
> monitored again.}}}}{{{{2019-09-16 13:38:39,780 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Connection to ZooKeeper was reconnected. Leader election can be
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Connection to ZooKeeper was reconnected. Leader election can be
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Connection to ZooKeeper was reconnected. Leader election can be
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.}}}}{{{{2019-09-16 13:38:43,142 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed
> checkpoint 6995 }}{{for}} {{job 21b6ef566750f5766443641254e8e1a9 (16841 bytes
> }}{{in}} {{49 ms).}}}}{{{{2019-09-16 13:38:43,144 ERROR
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Exception
> }}{{while}} {{triggering checkpoint }}{{for}} {{job
> 21b6ef566750f5766443641254e8e1a9.}}}}{{{{java.lang.IllegalStateException:
> Connection state: SUSPENDED}}}}{{{{ }}{{at
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.checkConnectionState(ZooKeeperCheckpointIDCounter.java:159)}}}}{{{{
> }}{{at
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.get(ZooKeeperCheckpointIDCounter.java:133)}}}}{{{{
> }}{{at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:448)}}}}{{{{
> }}{{at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:1323)}}}}{{{{
> }}{{at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}}}{{{{
> }}{{at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)}}}}{{{{
> }}{{at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)}}}}{{{{
> }}{{at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)}}}}{{{{
> }}{{at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)}}}}{{{{
> }}{{at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)}}}}{{{{
> }}{{at java.lang.Thread.run(Thread.java:745)}}}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)