Peng Wang created FLINK-14091:
---------------------------------
Summary: Job can not trigger checkpoint forever after zookeeper
change leader
Key: FLINK-14091
URL: https://issues.apache.org/jira/browse/FLINK-14091
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.9.0
Reporter: Peng Wang
when zk change leader, the state of curator is suspended,job manager can not
tigger checkpoint.but it doesn't tigger checkpoint after zk resume.
we found that the lastState in the class ZooKeeperCheckpointIDCounter never
change back to normal when it fall into SUSPENDED or LOST.
h6. _/**_
_* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} or
{@link_
_* ConnectionState#LOST} we are not guaranteed to read a current count from
ZooKeeper._
_*/_
_private static class SharedCountConnectionStateListener implements
ConnectionStateListener {_
_private volatile ConnectionState lastState;_
_@Override_
_public void stateChanged(CuratorFramework client, ConnectionState newState) {_
_if (newState == ConnectionState.SUSPENDED || newState ==
ConnectionState.LOST) {_
_lastState = newState;_
_}_
_}_
_private ConnectionState getLastState() {_
_return lastState;_
_}_
_}_
we change the state back. after test, solve the problem.
h6. _/**_
_* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} or
{@link_
_* ConnectionState#LOST} we are not guaranteed to read a current count from
ZooKeeper._
_*/_
_private static class SharedCountConnectionStateListener implements
ConnectionStateListener {_
_private volatile ConnectionState lastState;_
_@Override_
_public void stateChanged(CuratorFramework client, ConnectionState newState) {_
_if (newState == ConnectionState.SUSPENDED || newState ==
ConnectionState.LOST) {_
_lastState = newState;_
_}_
_else{_
_/* if connectionState is not SUSPENDED and LOST, reset lastState. */_
_lastState = null;_
_}_
_}_
_private ConnectionState getLastState() {_
_return lastState;_
_}_
_}_
log:
h6. {{{{2019-09-16 13:38:38,020 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to
}}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, likely
server has closed socket, closing socket connection and attempting
reconnect}}}}{{{{2019-09-16 13:38:38,122 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: SUSPENDED}}}}{{{{2019-09-16 13:38:38,123 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore -
ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not
monitored (temporarily).}}}}{{{{2019-09-16 13:38:38,128 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper suspended. The contender
akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/dispatcher}} {{no longer
participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128
WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper suspended. The contender
akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/resourcemanager}} {{no
longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16
13:38:38,128 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper suspended. The contender http:}}{{//node007224}}{{:8081
no longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16
13:38:38,128 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper suspended. The contender
akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/jobmanager_2}} {{no longer
participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128
WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
- Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.}}}}{{{{2019-09-16 13:38:39,109 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified
JAAS configuration }}{{file}}{{: }}{{'/tmp/jaas-4823064314619540149.conf'}}{{.
Will }}{{continue}} {{connection to Zookeeper server without SASL
authentication, }}{{if}} {{Zookeeper server allows it.}}}}{{{{2019-09-16
13:38:39,109 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening
socket connection to server
192.168.7.231}}{{/192}}{{.168.7.231:2181}}}}{{{{2019-09-16 13:38:39,109 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState -
Authentication failed}}}}{{{{2019-09-16 13:38:39,110 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket
connection established to 192.168.7.231}}{{/192}}{{.168.7.231:2181, initiating
session}}}}{{{{2019-09-16 13:38:39,112 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to
}}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, likely
server has closed socket, closing socket connection and attempting
reconnect}}}}{{{{2019-09-16 13:38:39,778 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified
JAAS configuration }}{{file}}{{: }}{{'/tmp/jaas-4823064314619540149.conf'}}{{.
Will }}{{continue}} {{connection to Zookeeper server without SASL
authentication, }}{{if}} {{Zookeeper server allows it.}}}}{{{{2019-09-16
13:38:39,778 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening
socket connection to server
192.168.7.230}}{{/192}}{{.168.7.230:2181}}}}{{{{2019-09-16 13:38:39,778 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState -
Authentication failed}}}}{{{{2019-09-16 13:38:39,778 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket
connection established to 192.168.7.230}}{{/192}}{{.168.7.230:2181, initiating
session}}}}{{{{2019-09-16 13:38:39,780 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session
establishment complete on server 192.168.7.230}}{{/192}}{{.168.7.230:2181,
sessionid = 0x26cff6487c2000e, negotiated timeout = 60000}}}}{{{{2019-09-16
13:38:39,780 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: RECONNECTED}}}}{{{{2019-09-16 13:38:39,780 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper was reconnected. Leader election can be
restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore -
ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
monitored again.}}}}{{{{2019-09-16 13:38:39,780 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper was reconnected. Leader election can be
restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper was reconnected. Leader election can be
restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
Connection to ZooKeeper was reconnected. Leader election can be
restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.}}}}{{{{2019-09-16 13:38:43,142 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed
checkpoint 6995 }}{{for}} {{job 21b6ef566750f5766443641254e8e1a9 (16841 bytes
}}{{in}} {{49 ms).}}}}{{{{2019-09-16 13:38:43,144 ERROR
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Exception
}}{{while}} {{triggering checkpoint }}{{for}} {{job
21b6ef566750f5766443641254e8e1a9.}}}}{{{{java.lang.IllegalStateException:
Connection state: SUSPENDED}}}}{{{{ }}{{at
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.checkConnectionState(ZooKeeperCheckpointIDCounter.java:159)}}}}{{{{
}}{{at
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.get(ZooKeeperCheckpointIDCounter.java:133)}}}}{{{{
}}{{at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:448)}}}}{{{{
}}{{at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:1323)}}}}{{{{
}}{{at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}}}{{{{
}}{{at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)}}}}{{{{
}}{{at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)}}}}{{{{
}}{{at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)}}}}{{{{
}}{{at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)}}}}{{{{
}}{{at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)}}}}{{{{
}}{{at java.lang.Thread.run(Thread.java:745)}}}}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)