[
https://issues.apache.org/jira/browse/ZOOKEEPER-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944719#comment-13944719
]
Michi Mutsuzaki commented on ZOOKEEPER-1900:
--------------------------------------------
In that case, it seems like FileTxnSnapLog.truncateLog() should throw an
exception to make Learner.syncWithLeader fail. Otherwise the follower ends up
missing some transactions.
Why would the log files be missing though? The follower must have told the
leader that it has lastzxid that's higher than the zxid in the TRUNC message. I
guess somebody outside of ZooKeeper could delete the log files...
> NullPointerException in truncate
> ---------------------------------
>
> Key: ZOOKEEPER-1900
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1900
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.5, 3.4.6
> Environment: linux java 1.6
> Reporter: Steven Bower
> Priority: Blocker
> Fix For: 3.4.7, 3.5.0
>
>
> The other day we started up a ZK instance that had been down for a bit (1day)
> and started getting NPEs all over the place...
> {noformat}
> 2014-20-03 11:15:42.320 INFO QuorumPeerConfig [main] - Reading configuration
> from: /xxx/bin/zk/etc/zk.cfg
> 2014-20-03 11:15:42.350 INFO QuorumPeerConfig [main] - Defaulting to
> majority quorums
> 2014-20-03 11:15:42.353 INFO DatadirCleanupManager [main] -
> autopurge.snapRetainCount set to 3
> 2014-20-03 11:15:42.353 INFO DatadirCleanupManager [main] -
> autopurge.purgeInterval set to 0
> 2014-20-03 11:15:42.353 INFO DatadirCleanupManager [main] - Purge task is
> not scheduled.
> 2014-20-03 11:15:42.385 INFO QuorumPeerMain [main] - Starting quorum peer
> 2014-20-03 11:15:42.399 INFO NIOServerCnxnFactory [main] - binding to port
> 0.0.0.0/0.0.0.0:5555
> 2014-20-03 11:15:42.413 INFO QuorumPeer [main] - tickTime set to 2000
> 2014-20-03 11:15:42.413 INFO QuorumPeer [main] - minSessionTimeout set to -1
> 2014-20-03 11:15:42.413 INFO QuorumPeer [main] - maxSessionTimeout set to -1
> 2014-20-03 11:15:42.413 INFO QuorumPeer [main] - initLimit set to 10
> 2014-20-03 11:15:42.456 INFO FileSnap [main] - Reading snapshot
> /xxx/zk_data/version-2/snapshot.2c00000000
> 2014-20-03 11:15:42.463 INFO QuorumCnxManager [Thread-3] - My election bind
> port: 0.0.0.0/0.0.0.0:7555
> 2014-20-03 11:15:42.470 INFO QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 11:15:42.471 INFO FastLeaderElection
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id = 3,
> proposed zxid=0x8000000000000000
> 2014-20-03 11:15:42.479 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.479 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO QuorumCnxManager [WorkerSender[myid=3]] - Have
> smaller server identifier, so dropping the connection: (5, 3)
> 2014-20-03 11:15:42.482 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 11:15:42.486 INFO Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - TCP NoDelay set to: true
> 2014-20-03 11:15:42.488 INFO QuorumCnxManager [host1/###.###.###.###:7555] -
> Received connection request /###.###.###.###:64528
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:host.name=host1
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:java.version=1.6.0_20
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:java.vendor=Sun Microsystems Inc.
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:java.home=/xxx/util/common/jdk1.6.0_20_64bit/jre
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:java.class.path=/xxx/bin/zk/etc:/xxx/bin/zk/lib/slf4j-log4j12-1.7.2.jar:/xxx/bin/zk/lib/jline-0.9.94.jar:/xxx/bin/zk/lib/jul-to-slf4j-1.7.2.jar:/xxx/bin/zk/lib/ZooInspector-3.4.5.jar:/xxx/bin/zk/lib/jcl-over-slf4j-1.7.2.jar:/xxx/bin/zk/lib/log4j-1.2.17.jar:/xxx/bin/zk/lib/zookeeper-3.4.5.jar:/xxx/bin/zk/lib/slf4j-api-1.7.2.jar:/xxx/bin/zk/lib/netty-3.2.2.Final.jar
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:java.library.path=/xxx/util/common/jdk1.6.0_20_64bit/jre/lib/amd64/server:/xxx/util/common/jdk1.6.0_20_64bit/jre/lib/amd64:/xxx/util/common/jdk1.6.0_20_64bit/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:java.io.tmpdir=/tmp
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:java.compiler=<NA>
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:os.name=Linux
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:os.arch=amd64
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:os.version=2.6.32-220.2.1.el6.x86_64
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:user.name=op
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:user.home=/xxx/bin
> 2014-20-03 11:15:42.490 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server
> environment:user.dir=/xxx/bin
> 2014-20-03 11:15:42.491 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2
> snapdir /xxx/zk_data/version-2
> 2014-20-03 11:15:42.493 INFO Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing
> host4/###.###.###.###:6555
> 2014-20-03 11:15:42.495 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING
> (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 11:15:42.498 WARN Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Truncating log to get in sync
> with the leader 0x2b00000002
> 2014-20-03 11:15:42.499 WARN QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception
> java.lang.NullPointerException
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.truncate(FileTxnLog.java:352)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.truncateLog(FileTxnSnapLog.java:259)
> at
> org.apache.zookeeper.server.ZKDatabase.truncateLog(ZKDatabase.java:438)
> at
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:339)
> at
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:72)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> 2014-20-03 11:15:42.500 INFO Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - shutdown called
> java.lang.Exception: shutdown Observer
> at
> org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:137)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:731)
> 2014-20-03 11:15:42.500 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - shutting down
> 2014-20-03 11:15:42.500 INFO QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 11:15:42.501 INFO FastLeaderElection
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id = 3,
> proposed zxid=0x8000000000000000
> 2014-20-03 11:15:42.503 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.503 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.503 INFO QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 11:15:42.503 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2
> snapdir /xxx/zk_data/version-2
> 2014-20-03 11:15:42.504 INFO Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing
> host4/###.###.###.###:6555
> 2014-20-03 11:15:42.504 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING
> (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 11:15:42.514 INFO FileSnap
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Reading snapshot
> /xxx/zk_data/version-2/snapshot.2c00000000
> 2014-20-03 11:15:42.517 WARN Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Truncating log to get in sync
> with the leader 0x2b00000002
> 2014-20-03 11:15:42.518 WARN QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception
> java.lang.NullPointerException
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.truncate(FileTxnLog.java:352)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.truncateLog(FileTxnSnapLog.java:259)
> at
> org.apache.zookeeper.server.ZKDatabase.truncateLog(ZKDatabase.java:438)
> at
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:339)
> at
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:72)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> {noformat}
> This exception went on and on over and over again (more than 1M times in a
> day) until it then began spewing this exception:
> {noformat}
> 2014-20-03 13:45:32.843 INFO QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 13:45:32.843 INFO FastLeaderElection
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id = 3,
> proposed zxid=0x8000000000000000
> 2014-20-03 13:45:32.844 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 13:45:32.845 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 13:45:32.845 INFO QuorumPeer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 13:45:32.845 INFO FastLeaderElection [WorkerReceiver[myid=3]] -
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING
> (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 13:45:32.845 INFO ZooKeeperServer
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2
> snapdir /xxx/zk_data/version-2
> 2014-20-03 13:45:32.845 INFO Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing
> host4/###.###.###.###:6555
> 2014-20-03 13:45:32.853 WARN Learner
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception, tries=0,
> connecting to host4/###.###.###.###:6555
> java.net.ConnectException: Cannot assign requested address
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
> at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> at java.net.Socket.connect(Socket.java:529)
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:224)
> at
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:69)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> 2014-20-03 13:45:33.863 INFO FileSnap
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Reading snapshot
> /xxx/zk_data/version-2/snapshot.2c00000000
> {noformat}
> This exception for a while was interspersed with the NPEs but eventually it
> just was spewing the ConnectionException.
> Looking through the code a bit it seems if the FileTxnIterator when
> initialized cannot find any log files the {{inputStream}} is set to null
> which causes truncate() to NPE.. I see in 3.4.6 this has been wrapped in a
> try/finally which closes the iterator.. but i presume that this issue would
> still remain.
> Looking at the system in this state there were 29k+ sockets in CLOSE_WAIT
> state on the system and looking at a heap dump there were tons of Socket
> objects waiting for GC (ie not getting properly closed).. this eventually ran
> the system out of ephemeral ports and hence the ConnectionExceptions..
> It would seem that a quick check of {{itr.next()}} prior to attempting
> truncation would resolve the NPE, but it seems somewhere a connection is not
> being closed properly when an exception occurs.
--
This message was sent by Atlassian JIRA
(v6.2#6252)