Raven888888 opened a new issue, #3236:
URL: https://github.com/apache/bookkeeper/issues/3236

   Hello! I'd like to ask about 2 issues that I've faced:
   
   **Question 1:**
   
   We have 3 servers, each hosted Pulsar Broker, BK and ZK as a cluster. And 
for unknown reason, the Broker runs for a minute and stopped due to timeout 
error. And we notice that there are warnings from broker.log in Server1:
   
   ```
   14:24:52.311 [pulsar-client-io-48-1] WARN  
org.apache.pulsar.client.impl.ConnectionPool - Failed to open connection to 
Server2:6650 : io.netty.channel.AbstractChannel$AnnotatedConnectException: 
finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
   14:24:52.311 [pulsar-client-io-48-1] WARN  
org.apache.pulsar.client.impl.ConnectionHandler - 
[persistent://public/flink_u/__change_events] [reader-1a857264a8] Error 
connecting to broker: org.apache.pulsar.client.api.PulsarClientException: 
java.util.concurrent.CompletionException: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) 
failed: Connection refused: Server2/X.X.X.2:6650
   14:24:52.311 [pulsar-client-io-48-1] WARN  
org.apache.pulsar.client.impl.PulsarClientImpl - 
[persistent://public/flink_u/__change_events] Failed to get create topic reader
   java.util.concurrent.CompletionException: 
org.apache.pulsar.client.api.PulsarClientException: 
java.util.concurrent.CompletionException: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) 
failed: Connection refused: Server2/X.X.X.2:6650
           at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) 
~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_312]
           at 
org.apache.pulsar.client.impl.ConsumerImpl.connectionFailed(ConsumerImpl.java:994)
 ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
           at 
org.apache.pulsar.client.impl.ConnectionHandler.handleConnectionError(ConnectionHandler.java:82)
 ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
           at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_312]
           at 
org.apache.pulsar.client.impl.ConnectionPool.lambda$null$9(ConnectionPool.java:217)
 ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
           at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
 [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
 [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) 
[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
 [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
   Caused by: org.apache.pulsar.client.api.PulsarClientException: 
java.util.concurrent.CompletionException: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) 
failed: Connection refused: Server2/X.X.X.2:6650
           ... 8 more
   Caused by: java.util.concurrent.CompletionException: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) 
failed: Connection refused: Server2/X.X.X.2:6650
           at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) 
~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
 ~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_312]
           at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_312]
           at 
org.apache.pulsar.common.util.netty.ChannelFutures.lambda$toCompletableFuture$0(ChannelFutures.java:59)
 ~[org.apache.pulsar-pulsar-common-2.7.0.jar:2.7.0]
           at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
 ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570)
 ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549)
 ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
 ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) 
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) 
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) 
~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:636)
 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:655)
 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           ... 4 more
   Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
   Caused by: java.net.ConnectException: finishConnect(..) failed: Connection 
refused
           at 
io.netty.channel.unix.Errors.throwConnectException(Errors.java:124) 
~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.unix.Socket.finishConnect(Socket.java:243) 
~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)
 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)
 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           at 
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) 
~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
           ... 4 more
   ```
   
   We know that this is due to the Broker in Server2 unexpectedly restarted, 
but this also happens on Server1 and Server3 as well. We just couldn't find 
which part is the one we need to investigate.Note that the ZK cluster are 
working fine, and BK shows a lot of these warnings and errors in bookie.log:
   
   ```
   10:57:40.415 [BookieHighPriorityThread-3181-OrderedExecutor-3-0] WARN  
org.apache.bookkeeper.proto.ReadEntryProcessor - Ledger: 112564971  fenced by: 
/X.X.X.1:39188
   ```
   
   = OR =
   
   ```
   10:57:09.890 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR 
org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while 
performing readLac from ledger: 76193704
   org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 76193704 not 
found
           at 
org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52)
 ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at 
org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61)
 ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at 
org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) 
~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at 
org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71)
 [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at 
org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118)
 [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at 
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) 
[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_312]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_312]
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
   ```
   
   **Question 2:**
   
   In addition, not sure whether is the right place to ask, but is there any 
way to rebuild or recover the state of bookkeeper cluster in ZK after we ran 
`bin/bookkeeper shell metaformat`?Currently our broker constantly shown this 
error below in broker.log:
   
   ```
   13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] ERROR 
org.apache.bookkeeper.client.LedgerCreateOp - Not enough bookies to create 
ledger
   ```
   
   In addition, when attempting to create a topic, this error happens in 
broker.log as well:
   
   ```
   13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] WARN  
org.apache.pulsar.broker.service.BrokerService - Failed to create topic 
persistent://public/data
   org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty 
bookies available
   ```
   
   I've ran `bin/bookkeeper shell listfilesondisc -journal`  and 
`bin/bookkeeper shell listfilesondisc -entryLog`, which I'm not sure whether I 
can recover back to previous state based on the available journal and entryLog 
files.Appreciate if anyone can help out. Many thanks! :smile:
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to