suiyuzeng opened a new issue #2760:
URL: https://github.com/apache/bookkeeper/issues/2760


   ***Describe the bug***
   Bookkeeper begin to shutdown after a zookeeper connection timeout. Session 
is not expired and it was attempting reconnect.
   In the log, there is a ConnectionLossException when a node was deleted in 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager#releaseUnderreplicatedLedger.
 It can not be processed in 
org.apache.bookkeeper.replication.ReplicationWorker#deferLedgerLockRelease and 
shutdown was called.
   `
   public void releaseUnderreplicatedLedger(long ledgerId) throws 
ReplicationException.UnavailableException {
           if (LOG.isDebugEnabled()) {
               LOG.debug("releaseLedger(ledgerId={})", ledgerId);
           }
           try {
               Lock l = heldLocks.get(ledgerId);
               if (l != null) {
                   zkc.delete(l.getLockZNode(), -1);
               }
           } catch (KeeperException.NoNodeException nne) {
               // this is ok
           } catch (KeeperException ke) {
               LOG.error("Error deleting underreplicated ledger lock", ke);
               throw new ReplicationException.UnavailableException("Error 
contacting zookeeper", ke);
           } catch (InterruptedException ie) {
               Thread.currentThread().interrupt();
               throw new ReplicationException.UnavailableException("Interrupted 
while connecting zookeeper", ie);
           }
           heldLocks.remove(ledgerId);
       }
   `
   It shoul be retried as syncCallWithRetries was called. One cause is that the 
maxRetries of the retry policy was setted to be 0. The zk client is initialized 
in org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase#initialize. And the 
retrying policy is created in 
org.apache.bookkeeper.meta.zk.ZKMetadataClientDriver#initialize.
   `
   public synchronized MetadataClientDriver initialize(ClientConfiguration conf,
                                                           
ScheduledExecutorService scheduler,
                                                           StatsLogger 
statsLogger,
   
                                                           Optional<Object> 
optionalCtx)
               throws MetadataException {
           super.initialize(
               conf,
               statsLogger,
               new BoundExponentialBackoffRetryPolicy(
                   conf.getZkTimeout(),
                   conf.getZkTimeout(),
                   0),
               optionalCtx);
           this.statsLogger = statsLogger;
           this.clientConf = conf;
           this.scheduler = scheduler;
           this.bookieAddressTracking = conf.getEnableBookieAddressTracking();
           return this;
       }
   `
   
   The maxRetries is setted 0. Can we set it to be Integer.MAX_VALUE? 
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   1. Run bookkeeper
   2. release ledger lock when zk timeout
   
   
   LOG:
   06:18:45.732 [main-SendThread(10.168.114.200:2181)] WARN  
org.apache.zookeeper.ClientCnxn - Session 0x37b0a37639d001d for sever 
xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is 
a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
additional data from server sessionid 0x37b0a37639d001d, likely server has 
closed socket
       at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
   
   
   06:18:45.757 [main-SendThread(10.168.114.200:2181)] WARN  
org.apache.zookeeper.ClientCnxn - Session 0x47b0a3221890013 for sever 
xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is 
a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
additional data from server sessionid 0x47b0a3221890013, likely server has 
closed socket
       at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
   06:18:45.859 [PendingReplicationTimer] ERROR 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Error deleting 
underreplicated ledger lock
   org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/iot-bk/ledgers/underreplication/locks/urL0000826574
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.   jar:4.14.1]
       at 
org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
       at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
   06:18:45.867 [ReplicationWorker] ERROR 
org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException 
while replicating fragments
   org.apache.bookkeeper.replication.ReplicationException$UnavailableException: 
Error contacting zookeeper
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:604)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/iot-bk/ledgers/underreplication/ledgers/0000/0000/0010/7cc4/urL0001080516
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2281) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:519)
 ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
 ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
 ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
 ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
 ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:597)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       ... 4 more
   06:18:45.867 [PendingReplicationTimer] ERROR 
org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException 
while replicating fragments of ledger 826574
   org.apache.bookkeeper.replication.ReplicationException$UnavailableException: 
Error contacting zookeeper
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:635)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.   jar:4.14.1]
       at 
org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
       at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/iot-bk/ledgers/underreplication/locks/urL0000826574
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) 
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.   jar:4.14.1]
       ... 3 more
   06:18:45.867 [PendingReplicationTimer] INFO  
org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication 
worker
   06:18:45.867 [PendingReplicationTimer] INFO  
org.apache.bookkeeper.replication.ReplicationWorker - Shutting down 
ReplicationWorker
   06:18:45.867 [main-EventThread] INFO  
org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is 
disconnected from zookeeper now, but it is OK unless we received EXPIRED    
event.
   06:18:45.874 [ReplicationWorker] INFO  
org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited 
loop!
   06:18:46.516 [AutoRecoveryDeathWatcher-3181] INFO  
org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher 
noticed the AutoRecovery is not running any more,exiting    the watch loop!
   06:18:46.516 [AutoRecoveryDeathWatcher-3181] ERROR 
org.apache.bookkeeper.common.component.ComponentStarter - Triggered 
exceptionHandler of Component: bookie-server because of Exception in  Thread: 
Thread[AutoRecoveryDeathWatcher-3181,5,main]
   java.lang.RuntimeException: AutoRecovery is not running any more
       at 
org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
   06:18:46.528 [component-shutdown-thread] INFO  
org.apache.bookkeeper.common.component.ComponentStarter - Closing component 
bookie-server in shutdown hook.
   06:18:46.530 [component-shutdown-thread] INFO  
org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto 
recovery: 0
   06:18:46.530 [component-shutdown-thread] INFO  
org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
   06:18:46.532 [component-shutdown-thread] INFO  
org.apache.bookkeeper.replication.Auditor - Shutting down auditor
   06:18:46.532 [AuditorElector-10.90.126.13:3181] INFO  
org.apache.bookkeeper.replication.AuditorElector - Shutting down AuditorElector
   06:18:46.532 [component-shutdown-thread] INFO  
org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication 
worker


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to