Raven888888 opened a new issue, #3236:
URL: https://github.com/apache/bookkeeper/issues/3236
Hello! I'd like to ask about 2 issues that I've faced:
**Question 1:**
We have 3 servers, each hosted Pulsar Broker, BK and ZK as a cluster. And
for unknown reason, the Broker runs for a minute and stopped due to timeout
error. And we notice that there are warnings from broker.log in Server1:
```
14:24:52.311 [pulsar-client-io-48-1] WARN
org.apache.pulsar.client.impl.ConnectionPool - Failed to open connection to
Server2:6650 : io.netty.channel.AbstractChannel$AnnotatedConnectException:
finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
14:24:52.311 [pulsar-client-io-48-1] WARN
org.apache.pulsar.client.impl.ConnectionHandler -
[persistent://public/flink_u/__change_events] [reader-1a857264a8] Error
connecting to broker: org.apache.pulsar.client.api.PulsarClientException:
java.util.concurrent.CompletionException:
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..)
failed: Connection refused: Server2/X.X.X.2:6650
14:24:52.311 [pulsar-client-io-48-1] WARN
org.apache.pulsar.client.impl.PulsarClientImpl -
[persistent://public/flink_u/__change_events] Failed to get create topic reader
java.util.concurrent.CompletionException:
org.apache.pulsar.client.api.PulsarClientException:
java.util.concurrent.CompletionException:
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..)
failed: Connection refused: Server2/X.X.X.2:6650
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_312]
at
org.apache.pulsar.client.impl.ConsumerImpl.connectionFailed(ConsumerImpl.java:994)
~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
at
org.apache.pulsar.client.impl.ConnectionHandler.handleConnectionError(ConnectionHandler.java:82)
~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_312]
at
org.apache.pulsar.client.impl.ConnectionPool.lambda$null$9(ConnectionPool.java:217)
~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
at
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Caused by: org.apache.pulsar.client.api.PulsarClientException:
java.util.concurrent.CompletionException:
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..)
failed: Connection refused: Server2/X.X.X.2:6650
... 8 more
Caused by: java.util.concurrent.CompletionException:
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..)
failed: Connection refused: Server2/X.X.X.2:6650
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_312]
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_312]
at
org.apache.pulsar.common.util.netty.ChannelFutures.lambda$toCompletableFuture$0(ChannelFutures.java:59)
~[org.apache.pulsar-pulsar-common-2.7.0.jar:2.7.0]
at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570)
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549)
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:636)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:655)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection
refused
at
io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.channel.unix.Socket.finishConnect(Socket.java:243)
~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
... 4 more
```
We know that this is due to the Broker in Server2 unexpectedly restarted,
but this also happens on Server1 and Server3 as well. We just couldn't find
which part is the one we need to investigate.Note that the ZK cluster are
working fine, and BK shows a lot of these warnings and errors in bookie.log:
```
10:57:40.415 [BookieHighPriorityThread-3181-OrderedExecutor-3-0] WARN
org.apache.bookkeeper.proto.ReadEntryProcessor - Ledger: 112564971 fenced by:
/X.X.X.1:39188
```
= OR =
```
10:57:09.890 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR
org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while
performing readLac from ledger: 76193704
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 76193704 not
found
at
org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52)
~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at
org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61)
~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at
org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364)
~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at
org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71)
[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at
org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118)
[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_312]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_312]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
```
**Question 2:**
In addition, not sure whether is the right place to ask, but is there any
way to rebuild or recover the state of bookkeeper cluster in ZK after we ran
`bin/bookkeeper shell metaformat`?Currently our broker constantly shown this
error below in broker.log:
```
13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] ERROR
org.apache.bookkeeper.client.LedgerCreateOp - Not enough bookies to create
ledger
```
In addition, when attempting to create a topic, this error happens in
broker.log as well:
```
13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] WARN
org.apache.pulsar.broker.service.BrokerService - Failed to create topic
persistent://public/data
org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty
bookies available
```
I've ran `bin/bookkeeper shell listfilesondisc -journal` and
`bin/bookkeeper shell listfilesondisc -entryLog`, which I'm not sure whether I
can recover back to previous state based on the available journal and entryLog
files.Appreciate if anyone can help out. Many thanks! :smile:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]