[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944719#comment-13944719
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-1900:
--------------------------------------------

In that case, it seems like FileTxnSnapLog.truncateLog() should throw an 
exception to make Learner.syncWithLeader fail. Otherwise the follower ends up 
missing some transactions.

Why would the log files be missing though? The follower must have told the 
leader that it has lastzxid that's higher than the zxid in the TRUNC message. I 
guess somebody outside of ZooKeeper could delete the log files...

>  NullPointerException in truncate
> ---------------------------------
>
>                 Key: ZOOKEEPER-1900
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1900
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.5, 3.4.6
>         Environment: linux java 1.6
>            Reporter: Steven Bower
>            Priority: Blocker
>             Fix For: 3.4.7, 3.5.0
>
>
> The other day we started up a ZK instance that had been down for a bit (1day) 
> and started getting NPEs all over the place...
> {noformat}
> 2014-20-03 11:15:42.320 INFO  QuorumPeerConfig [main] - Reading configuration 
> from: /xxx/bin/zk/etc/zk.cfg
> 2014-20-03 11:15:42.350 INFO  QuorumPeerConfig [main] - Defaulting to 
> majority quorums
> 2014-20-03 11:15:42.353 INFO  DatadirCleanupManager [main] - 
> autopurge.snapRetainCount set to 3
> 2014-20-03 11:15:42.353 INFO  DatadirCleanupManager [main] - 
> autopurge.purgeInterval set to 0
> 2014-20-03 11:15:42.353 INFO  DatadirCleanupManager [main] - Purge task is 
> not scheduled.
> 2014-20-03 11:15:42.385 INFO  QuorumPeerMain [main] - Starting quorum peer
> 2014-20-03 11:15:42.399 INFO  NIOServerCnxnFactory [main] - binding to port 
> 0.0.0.0/0.0.0.0:5555
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - tickTime set to 2000
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - minSessionTimeout set to -1
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - maxSessionTimeout set to -1
> 2014-20-03 11:15:42.413 INFO  QuorumPeer [main] - initLimit set to 10
> 2014-20-03 11:15:42.456 INFO  FileSnap [main] - Reading snapshot 
> /xxx/zk_data/version-2/snapshot.2c00000000
> 2014-20-03 11:15:42.463 INFO  QuorumCnxManager [Thread-3] - My election bind 
> port: 0.0.0.0/0.0.0.0:7555
> 2014-20-03 11:15:42.470 INFO  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 11:15:42.471 INFO  FastLeaderElection 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id =  3, 
> proposed zxid=0x8000000000000000
> 2014-20-03 11:15:42.479 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING 
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.479 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING 
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO  QuorumCnxManager [WorkerSender[myid=3]] - Have 
> smaller server identifier, so dropping the connection: (5, 3)
> 2014-20-03 11:15:42.482 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING 
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING 
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.482 INFO  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 11:15:42.486 INFO  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - TCP NoDelay set to: true
> 2014-20-03 11:15:42.488 INFO  QuorumCnxManager [host1/###.###.###.###:7555] - 
> Received connection request /###.###.###.###:64528
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:host.name=host1
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:java.version=1.6.0_20
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:java.vendor=Sun Microsystems Inc.
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:java.home=/xxx/util/common/jdk1.6.0_20_64bit/jre
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:java.class.path=/xxx/bin/zk/etc:/xxx/bin/zk/lib/slf4j-log4j12-1.7.2.jar:/xxx/bin/zk/lib/jline-0.9.94.jar:/xxx/bin/zk/lib/jul-to-slf4j-1.7.2.jar:/xxx/bin/zk/lib/ZooInspector-3.4.5.jar:/xxx/bin/zk/lib/jcl-over-slf4j-1.7.2.jar:/xxx/bin/zk/lib/log4j-1.2.17.jar:/xxx/bin/zk/lib/zookeeper-3.4.5.jar:/xxx/bin/zk/lib/slf4j-api-1.7.2.jar:/xxx/bin/zk/lib/netty-3.2.2.Final.jar
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:java.library.path=/xxx/util/common/jdk1.6.0_20_64bit/jre/lib/amd64/server:/xxx/util/common/jdk1.6.0_20_64bit/jre/lib/amd64:/xxx/util/common/jdk1.6.0_20_64bit/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:java.io.tmpdir=/tmp
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:java.compiler=<NA>
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:os.name=Linux
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:os.arch=amd64
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:os.version=2.6.32-220.2.1.el6.x86_64
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server environment:user.name=op
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:user.home=/xxx/bin
> 2014-20-03 11:15:42.490 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Server 
> environment:user.dir=/xxx/bin
> 2014-20-03 11:15:42.491 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000 
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2 
> snapdir /xxx/zk_data/version-2
> 2014-20-03 11:15:42.493 INFO  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing 
> host4/###.###.###.###:6555
> 2014-20-03 11:15:42.495 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING 
> (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 11:15:42.498 WARN  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Truncating log to get in sync 
> with the leader 0x2b00000002
> 2014-20-03 11:15:42.499 WARN  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception
> java.lang.NullPointerException
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog.truncate(FileTxnLog.java:352)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.truncateLog(FileTxnSnapLog.java:259)
>         at 
> org.apache.zookeeper.server.ZKDatabase.truncateLog(ZKDatabase.java:438)
>         at 
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:339)
>         at 
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:72)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> 2014-20-03 11:15:42.500 INFO  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - shutdown called
> java.lang.Exception: shutdown Observer
>         at 
> org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:137)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:731)
> 2014-20-03 11:15:42.500 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - shutting down
> 2014-20-03 11:15:42.500 INFO  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 11:15:42.501 INFO  FastLeaderElection 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id =  3, 
> proposed zxid=0x8000000000000000
> 2014-20-03 11:15:42.503 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING 
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.503 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING 
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 11:15:42.503 INFO  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 11:15:42.503 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000 
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2 
> snapdir /xxx/zk_data/version-2
> 2014-20-03 11:15:42.504 INFO  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing 
> host4/###.###.###.###:6555
> 2014-20-03 11:15:42.504 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING 
> (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 11:15:42.514 INFO  FileSnap 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Reading snapshot 
> /xxx/zk_data/version-2/snapshot.2c00000000
> 2014-20-03 11:15:42.517 WARN  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Truncating log to get in sync 
> with the leader 0x2b00000002
> 2014-20-03 11:15:42.518 WARN  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception
> java.lang.NullPointerException
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog.truncate(FileTxnLog.java:352)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.truncateLog(FileTxnSnapLog.java:259)
>         at 
> org.apache.zookeeper.server.ZKDatabase.truncateLog(ZKDatabase.java:438)
>         at 
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:339)
>         at 
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:72)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> {noformat}
> This exception went on and on over and over again (more than 1M times in a 
> day) until it then began spewing this exception:
> {noformat}
> 2014-20-03 13:45:32.843 INFO  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - LOOKING
> 2014-20-03 13:45:32.843 INFO  FastLeaderElection 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - New election. My id =  3, 
> proposed zxid=0x8000000000000000
> 2014-20-03 13:45:32.844 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING 
> (n.state), 1 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 13:45:32.845 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), LEADING 
> (n.state), 2 (n.sid), 0x2b (n.peerEPoch), LOOKING (my state)
> 2014-20-03 13:45:32.845 INFO  QuorumPeer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - OBSERVING
> 2014-20-03 13:45:32.845 INFO  FastLeaderElection [WorkerReceiver[myid=3]] - 
> Notification: 2 (n.leader), 0x2b00000002 (n.zxid), 0x2c (n.round), FOLLOWING 
> (n.state), 5 (n.sid), 0x2b (n.peerEPoch), OBSERVING (my state)
> 2014-20-03 13:45:32.845 INFO  ZooKeeperServer 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Created server with tickTime 2000 
> minSessionTimeout 4000 maxSessionTimeout 40000 datadir /xxx/zk_log/version-2 
> snapdir /xxx/zk_data/version-2
> 2014-20-03 13:45:32.845 INFO  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Observing 
> host4/###.###.###.###:6555
> 2014-20-03 13:45:32.853 WARN  Learner 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Unexpected exception, tries=0, 
> connecting to host4/###.###.###.###:6555
> java.net.ConnectException: Cannot assign requested address
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>         at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>         at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>         at java.net.Socket.connect(Socket.java:529)
>         at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:224)
>         at 
> org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:69)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:727)
> 2014-20-03 13:45:33.863 INFO  FileSnap 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:5555] - Reading snapshot 
> /xxx/zk_data/version-2/snapshot.2c00000000
> {noformat}
> This exception for a while was interspersed with the NPEs but eventually it 
> just was spewing the ConnectionException.
> Looking through the code a bit it seems if the FileTxnIterator when 
> initialized cannot find any log files the {{inputStream}} is set to null 
> which causes truncate() to NPE.. I see in 3.4.6 this has been wrapped in a 
> try/finally which closes the iterator.. but i presume that this issue would 
> still remain.
> Looking at the system in this state there were 29k+ sockets in CLOSE_WAIT 
> state on the system and looking at a heap dump there were tons of Socket 
> objects waiting for GC (ie not getting properly closed).. this eventually ran 
> the system out of ephemeral ports and hence the ConnectionExceptions..
> It would seem that a quick check of {{itr.next()}} prior to attempting 
> truncation would resolve the NPE, but it seems somewhere a connection is not 
> being closed properly when an exception occurs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to