[ 
https://issues.apache.org/jira/browse/FLINK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-14091.
-----------------------------------
    Resolution: Fixed

Fixed via

master:
4b98956f968cb7abf83673e570262f439ca99fe9
25b169744d348afa9d7deac98fa7ab3592343b32
7a0fa1e09979f91f6787c63db2af8143faa8e973
7455a0946ef80ea45f0e79116f99c2812cb6aa5f

1.10.0:
7181254cb45be275039b47db14ac8ff1c030577e
7f27bb6ae139e8628230e5caaaf7b2550c2d4490
7fcda36fe58a28891c4104ec9926b1bf281e7c49
4b78a4e41a138820e0a07ccdc056729180aa7dd6

> Job can not trigger checkpoint forever after zookeeper change leader 
> ---------------------------------------------------------------------
>
>                 Key: FLINK-14091
>                 URL: https://issues.apache.org/jira/browse/FLINK-14091
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.0
>            Reporter: Peng Wang
>            Assignee: Zili Chen
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> when zk change leader, the state of curator is suspended,job manager can not 
> tigger checkpoint.but it doesn't tigger checkpoint after zk resume.
> we found that the lastState in the class ZooKeeperCheckpointIDCounter  never 
> change back to normal when it fall into SUSPENDED or LOST.
> h6. _/**_
>  _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} 
> or {@link_
>  _* ConnectionState#LOST} we are not guaranteed to read a current count from 
> ZooKeeper._
>  _*/_
> _private static class SharedCountConnectionStateListener implements 
> ConnectionStateListener {_
>  _private volatile ConnectionState lastState;_
>  _@Override_
>  _public void stateChanged(CuratorFramework client, ConnectionState newState) 
> {_
>  _if (newState == ConnectionState.SUSPENDED || newState == 
> ConnectionState.LOST) {_
>  _lastState = newState;_
>  _}_
>  _}_
>  _private ConnectionState getLastState() {_
>  _return lastState;_
>  _}_
> _}_
>  
> we change the state back. after test, solve the problem.
>  
> h6. _/**_
>  _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} 
> or {@link_
>  _* ConnectionState#LOST} we are not guaranteed to read a current count from 
> ZooKeeper._
>  _*/_
> _private static class SharedCountConnectionStateListener implements 
> ConnectionStateListener {_
>  _private volatile ConnectionState lastState;_
>  _@Override_
>  _public void stateChanged(CuratorFramework client, ConnectionState newState) 
> {_
>  _if (newState == ConnectionState.SUSPENDED || newState == 
> ConnectionState.LOST) {_
>  _lastState = newState;_
>  _}_
>  _else{_
>  _/* if connectionState is not SUSPENDED and LOST, reset lastState. */_
>  _lastState = null;_
>  _}_
>  _}_
>  _private ConnectionState getLastState() {_
>  _return lastState;_
>  _}_
> _}_
>  
> log:
> h6. {{{{2019-09-16 13:38:38,020 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, 
> likely server has closed socket, closing socket connection and attempting 
> reconnect}}}}{{{{2019-09-16 13:38:38,122 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: SUSPENDED}}}}{{{{2019-09-16 13:38:38,123 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not 
> monitored (temporarily).}}}}{{{{2019-09-16 13:38:38,128 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/dispatcher}} {{no longer 
> participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 
> WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  
> - Connection to ZooKeeper suspended. The contender 
> akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/resourcemanager}} {{no 
> longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 
> 13:38:38,128 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> http:}}{{//node007224}}{{:8081 no longer participates }}{{in}} {{the leader 
> election.}}}}{{{{2019-09-16 13:38:38,128 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/jobmanager_2}} {{no longer 
> participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 
> WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.}}}}{{{{2019-09-16 13:38:39,109 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified 
> JAAS configuration }}{{file}}{{: 
> }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}} 
> {{connection to Zookeeper server without SASL authentication, }}{{if}} 
> {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,109 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
> socket connection to server 
> 192.168.7.231}}{{/192}}{{.168.7.231:2181}}}}{{{{2019-09-16 13:38:39,109 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed}}}}{{{{2019-09-16 13:38:39,110 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 192.168.7.231}}{{/192}}{{.168.7.231:2181, 
> initiating session}}}}{{{{2019-09-16 13:38:39,112 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, 
> likely server has closed socket, closing socket connection and attempting 
> reconnect}}}}{{{{2019-09-16 13:38:39,778 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified 
> JAAS configuration }}{{file}}{{: 
> }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}} 
> {{connection to Zookeeper server without SASL authentication, }}{{if}} 
> {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,778 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
> socket connection to server 
> 192.168.7.230}}{{/192}}{{.168.7.230:2181}}}}{{{{2019-09-16 13:38:39,778 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed}}}}{{{{2019-09-16 13:38:39,778 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 192.168.7.230}}{{/192}}{{.168.7.230:2181, 
> initiating session}}}}{{{{2019-09-16 13:38:39,780 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
> establishment complete on server 192.168.7.230}}{{/192}}{{.168.7.230:2181, 
> sessionid = 0x26cff6487c2000e, negotiated timeout = 60000}}}}{{{{2019-09-16 
> 13:38:39,780 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: RECONNECTED}}}}{{{{2019-09-16 13:38:39,780 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper was reconnected. Leader retrieval can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper was reconnected. Leader retrieval can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper was reconnected. Leader election can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are 
> monitored again.}}}}{{{{2019-09-16 13:38:39,780 INFO  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper was reconnected. Leader election can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper was reconnected. Leader retrieval can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper was reconnected. Leader election can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper was reconnected. Leader retrieval can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper was reconnected. Leader election can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper was reconnected. Leader retrieval can be 
> restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper was reconnected. Leader retrieval can be 
> restarted.}}}}{{{{2019-09-16 13:38:43,142 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 6995 }}{{for}} {{job 21b6ef566750f5766443641254e8e1a9 (16841 bytes 
> }}{{in}} {{49 ms).}}}}{{{{2019-09-16 13:38:43,144 ERROR 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Exception 
> }}{{while}} {{triggering checkpoint }}{{for}} {{job 
> 21b6ef566750f5766443641254e8e1a9.}}}}{{{{java.lang.IllegalStateException: 
> Connection state: SUSPENDED}}}}{{{{    }}{{at 
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.checkConnectionState(ZooKeeperCheckpointIDCounter.java:159)}}}}{{{{
>     }}{{at 
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.get(ZooKeeperCheckpointIDCounter.java:133)}}}}{{{{
>     }}{{at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:448)}}}}{{{{
>     }}{{at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:1323)}}}}{{{{
>     }}{{at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}}}{{{{
>     }}{{at 
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)}}}}{{{{    
> }}{{at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)}}}}{{{{
>     }}{{at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)}}}}{{{{
>     }}{{at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)}}}}{{{{
>     }}{{at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)}}}}{{{{
>     }}{{at java.lang.Thread.run(Thread.java:745)}}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to