devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840894974
I noticed that in almost every case where the broker is getting a SEND
command, it's producing a SEND_RECEIPT command:
```
$ cat messages_broker10.log | grep 'Received cmd SEND' | wc -l
20806
$ cat messages_broker10.log | grep
'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
20806
$ cat messages_broker8.log | grep 'Received cmd SEND' | wc -l
44654
$ cat messages_broker8.log | grep
'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
44651
```
I assumed I'd see a gap between them, but unless something is creating some
kind of logical deadlock that's causing it to stop emitting logs in between
those commands, it seems like the problem is either before the SEND command or
after the SEND_RECEIPT command.
With that said, I noticed that the `lastConfirmedEntry` isn't progressing on
one of the threads after the freeze occurred:
```
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10636 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10672 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10696 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10747 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10775 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10783 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10815 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10840 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10886 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10915 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10916 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10917 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10918 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10919 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10920 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10944 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10945 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10948 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10952 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10954 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10957 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10959 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10963 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10974 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10977 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10982 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:10984 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11029 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11036 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11039 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11051 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11053 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11065 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11072 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11077 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11086 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11179 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11190 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11199 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11239 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11252 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11258 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11264 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11358 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11359 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11385 -- last: 3215336:11436
DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid
position: 3215336:11437 -- last: 3215336:11436
```
When I say "freeze," that may be misleading because it's not a complete
lockup. We still get messages to produce, but it's so slow that it's almost
equivalent to being frozen. As a comparison, here's one of the functions when
it's healthy:

Here's that same function when it's "frozen":

So, over 1 minute, it's processing 71 messages instead of 160,851 messages.
I also noticed this warning in the logs after it had been stuck for a while,
but I'm not sure if it's related:
```
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,634
[main-SendThread(server10.example.com:2181)] WARN
org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd038f for sever
server10.example.com/10.20.69.29:2181, Closing socket connection. Attempting
reconnect except it is a SessionExpiredException.
May 12 21:51:16 server10 3ed25e05b45a:
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed
out, have not heard from server in 10302ms for session id 0x1028ed903dd038f
May 12 21:51:16 server10 3ed25e05b45a: #011at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,634
[main-SendThread(server10.example.com:2181)] WARN
org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever
server10.example.com/10.20.69.29:2181, Closing socket connection. Attempting
reconnect except it is a SessionExpiredException.
May 12 21:51:16 server10 3ed25e05b45a:
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed
out, have not heard from server in 12554ms for session id 0x1028ed903dd0390
May 12 21:51:16 server10 3ed25e05b45a: #011at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243)
[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
May 12 21:51:16 server10 cb3b837bde84: 2021-05-13T03:51:16,635
[NIOWorkerThread-72] WARN org.apache.zookeeper.server.NIOServerCnxn - Unable to
read additional data from client sessionid 0x1028ed903dd038f, likely client has
closed socket
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,637
[pulsar-io-28-20] INFO org.apache.pulsar.broker.service.ServerCnx - Closed
connection from /10.20.17.40:44725
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,637
[pulsar-io-28-21] INFO org.apache.pulsar.broker.service.ServerCnx - Closed
connection from /10.20.17.41:58126
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,639
[pulsar-io-28-19] INFO org.apache.pulsar.broker.service.ServerCnx - Closed
connection from /10.20.17.41:45541
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,780
[pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder -
[/10.20.69.28:36462] Received cmd PING
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,780
[pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id:
0x1f7c0cfe, L:/10.20.69.29:6650 - R:/10.20.69.28:36462]] Replying back to ping
message
May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,780
[pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder -
[/10.20.69.28:36462] Received cmd PONG
May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,036
[pulsar-io-28-19] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id:
0xb055a726, L:/10.20.69.29:6651 - R:/10.20.69.28:36414]] Sending ping message
May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,551
[main-SendThread(server09.example.com:2181)] WARN
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service,
session 0x1028ed903dd0390 has expired
May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,551
[main-EventThread] ERROR org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase -
ZooKeeper client connection to the ZooKeeper server has expired!
May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,551
[main-SendThread(server09.example.com:2181)] WARN
org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever
server09.example.com/10.20.69.28:2181, Closing socket connection. Attempting
reconnect except it is a SessionExpiredException.
May 12 21:51:17 server10 3ed25e05b45a:
org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to
ZooKeeper service, session 0x1028ed903dd0390 has expired
May 12 21:51:17 server10 3ed25e05b45a: #011at
org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1419)
~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
May 12 21:51:17 server10 3ed25e05b45a: #011at
org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154)
~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
May 12 21:51:17 server10 3ed25e05b45a: #011at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86)
~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
May 12 21:51:17 server10 3ed25e05b45a: #011at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
May 12 21:51:17 server10 3ed25e05b45a: #011at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275)
[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]