suiyuzeng opened a new issue #2760:
URL: https://github.com/apache/bookkeeper/issues/2760
***Describe the bug***
Bookkeeper begin to shutdown after a zookeeper connection timeout. Session
is not expired and it was attempting reconnect.
In the log, there is a ConnectionLossException when a node was deleted in
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager#releaseUnderreplicatedLedger.
It can not be processed in
org.apache.bookkeeper.replication.ReplicationWorker#deferLedgerLockRelease and
shutdown was called.
`
public void releaseUnderreplicatedLedger(long ledgerId) throws
ReplicationException.UnavailableException {
if (LOG.isDebugEnabled()) {
LOG.debug("releaseLedger(ledgerId={})", ledgerId);
}
try {
Lock l = heldLocks.get(ledgerId);
if (l != null) {
zkc.delete(l.getLockZNode(), -1);
}
} catch (KeeperException.NoNodeException nne) {
// this is ok
} catch (KeeperException ke) {
LOG.error("Error deleting underreplicated ledger lock", ke);
throw new ReplicationException.UnavailableException("Error
contacting zookeeper", ke);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new ReplicationException.UnavailableException("Interrupted
while connecting zookeeper", ie);
}
heldLocks.remove(ledgerId);
}
`
It shoul be retried as syncCallWithRetries was called. One cause is that the
maxRetries of the retry policy was setted to be 0. The zk client is initialized
in org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase#initialize. And the
retrying policy is created in
org.apache.bookkeeper.meta.zk.ZKMetadataClientDriver#initialize.
`
public synchronized MetadataClientDriver initialize(ClientConfiguration conf,
ScheduledExecutorService scheduler,
StatsLogger
statsLogger,
Optional<Object>
optionalCtx)
throws MetadataException {
super.initialize(
conf,
statsLogger,
new BoundExponentialBackoffRetryPolicy(
conf.getZkTimeout(),
conf.getZkTimeout(),
0),
optionalCtx);
this.statsLogger = statsLogger;
this.clientConf = conf;
this.scheduler = scheduler;
this.bookieAddressTracking = conf.getEnableBookieAddressTracking();
return this;
}
`
The maxRetries is setted 0. Can we set it to be Integer.MAX_VALUE?
***To Reproduce***
Steps to reproduce the behavior:
1. Run bookkeeper
2. release ledger lock when zk timeout
LOG:
06:18:45.732 [main-SendThread(10.168.114.200:2181)] WARN
org.apache.zookeeper.ClientCnxn - Session 0x37b0a37639d001d for sever
xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is
a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read
additional data from server sessionid 0x37b0a37639d001d, likely server has
closed socket
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
06:18:45.757 [main-SendThread(10.168.114.200:2181)] WARN
org.apache.zookeeper.ClientCnxn - Session 0x47b0a3221890013 for sever
xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is
a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read
additional data from server sessionid 0x47b0a3221890013, likely server has
closed socket
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
06:18:45.859 [PendingReplicationTimer] ERROR
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Error deleting
underreplicated ledger lock
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at
org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
06:18:45.867 [ReplicationWorker] ERROR
org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException
while replicating fragments
org.apache.bookkeeper.replication.ReplicationException$UnavailableException:
Error contacting zookeeper
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:604)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/iot-bk/ledgers/underreplication/ledgers/0000/0000/0010/7cc4/urL0001080516
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2281)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:519)
~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556)
~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:597)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
... 4 more
06:18:45.867 [PendingReplicationTimer] ERROR
org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException
while replicating fragments of ledger 826574
org.apache.bookkeeper.replication.ReplicationException$UnavailableException:
Error contacting zookeeper
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:635)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at
org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
... 3 more
06:18:45.867 [PendingReplicationTimer] INFO
org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication
worker
06:18:45.867 [PendingReplicationTimer] INFO
org.apache.bookkeeper.replication.ReplicationWorker - Shutting down
ReplicationWorker
06:18:45.867 [main-EventThread] INFO
org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is
disconnected from zookeeper now, but it is OK unless we received EXPIRED
event.
06:18:45.874 [ReplicationWorker] INFO
org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited
loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] INFO
org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher
noticed the AutoRecovery is not running any more,exiting the watch loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] ERROR
org.apache.bookkeeper.common.component.ComponentStarter - Triggered
exceptionHandler of Component: bookie-server because of Exception in Thread:
Thread[AutoRecoveryDeathWatcher-3181,5,main]
java.lang.RuntimeException: AutoRecovery is not running any more
at
org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237)
~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
06:18:46.528 [component-shutdown-thread] INFO
org.apache.bookkeeper.common.component.ComponentStarter - Closing component
bookie-server in shutdown hook.
06:18:46.530 [component-shutdown-thread] INFO
org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto
recovery: 0
06:18:46.530 [component-shutdown-thread] INFO
org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
06:18:46.532 [component-shutdown-thread] INFO
org.apache.bookkeeper.replication.Auditor - Shutting down auditor
06:18:46.532 [AuditorElector-10.90.126.13:3181] INFO
org.apache.bookkeeper.replication.AuditorElector - Shutting down AuditorElector
06:18:46.532 [component-shutdown-thread] INFO
org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication
worker
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]