dlg99 opened a new pull request, #3374:
URL: https://github.com/apache/bookkeeper/pull/3374
Descriptions of the changes in this PR:
Shut down Replication Worker and Auditor on non-recoverable ZK error
### Motivation
Some errors require one to re-create Zk client.
Currently BK and underlying components cannot do that transparently.
When running AutoRecovery as a separate service it does not seem to crash in
such cases and keeps on running while unable to do anything useful and requires
operator restarting the service manually.
One can see such log messages like
```
ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker
- UnavailableException while replicating fragments
org.apache.bookkeeper.replication.ReplicationException$UnavailableException:
Error contacting zookeeper
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:610)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:264)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:230)
[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.76.Final.jar:4.1.76.Final]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/ledgers/underreplication/ledgers/0000/0001/0ebb
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2589)
~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$3701(ZooKeeperClient.java:70)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$27.call(ZooKeeperClient.java:1251)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$27.call(ZooKeeperClient.java:1245)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.getChildren(ZooKeeperClient.java:1245)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager$1.getChildren(ZkLedgerUnderreplicationManager.java:147)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.util.SubTreeCache.getChildren(SubTreeCache.java:118)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:550)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:562)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:562)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:562)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:603)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
```
```
2022-06-24T17:15:55,405 [ZkLedgerManagerScheduler-11-1] ERROR
org.apache.bookkeeper.replication.Auditor - Underreplication manager
unavailable running periodic check
org.apache.bookkeeper.replication.ReplicationException$UnavailableException:
Error contacting zookeeper
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:731)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.replication.Auditor.lambda$checkAllLedgers$7(Auditor.java:1254)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.AbstractZkLedgerManager$5.lambda$operationComplete$0(AbstractZkLedgerManager.java:573)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.76.Final.jar:4.1.76.Final]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /ledgers/underreplication/disable
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2021)
~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2049)
~[com.datastax.oss-pulsar-zookeeper-2.7.2.1.1.33.jar:2.7.2.1.1.33]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
at
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:726)
~[com.datastax.oss-bookkeeper-server-4.14.5.1.0.0.jar:4.14.5.1.0.0]
```
and other similar
### Changes
Now Replication Worker and Auditor will shut down on such errors making
their error state visible / letting k8s or service monitor restart them.
Added tests.
Removed KeeperException from some interfaces/implementations to prevent raw
ZK exception sneaking throw but there are a few others , see
https://github.com/apache/bookkeeper/issues/3373
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]