devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840894974


   I noticed that in almost every case where the broker is getting a SEND 
command, it's producing a SEND_RECEIPT command:
   ```
   $ cat messages_broker10.log | grep 'Received cmd SEND' | wc -l
      20806
   $ cat messages_broker10.log | grep 
'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      20806
   $ cat messages_broker8.log | grep 'Received cmd SEND' | wc -l
      44654
   $ cat messages_broker8.log | grep 
'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      44651
   ```
   I assumed I'd see a gap between them, but unless something is creating some 
kind of logical deadlock that's causing it to stop emitting logs in between 
those commands, it seems like the problem is either before the SEND command or 
after the SEND_RECEIPT command. 
   
   With that said, I noticed that the `lastConfirmedEntry` isn't progressing on 
one of the threads after the freeze occurred:
   
   ```
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10636 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10672 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10696 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10747 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10775 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10783 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10815 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10840 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10886 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10915 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10916 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10917 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10918 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10919 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10920 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10944 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10945 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10948 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10952 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10954 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10957 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10959 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10963 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10974 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10977 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10982 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:10984 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11029 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11036 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11039 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11051 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11053 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11065 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11072 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11077 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11086 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11179 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11190 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11199 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11239 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11252 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11258 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11264 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11358 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11359 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11385 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid 
position: 3215336:11437 -- last: 3215336:11436
   ```
   
   When I say "freeze," that may be misleading because it's not a complete 
lockup. We still get messages to produce, but it's so slow that it's almost 
equivalent to being frozen. As a comparison, here's one of the functions when 
it's healthy:
   
![image](https://user-images.githubusercontent.com/7418031/118199967-6365ed00-b411-11eb-871c-7b7f8f186404.png)
   
   Here's that same function when it's "frozen":
   
![image](https://user-images.githubusercontent.com/7418031/118200011-78428080-b411-11eb-80ee-c802471b3c4c.png)
   
   So, over 1 minute, it's processing 71 messages instead of 160,851 messages.
   
   I also noticed this warning in the logs after it had been stuck for a while, 
but I'm not sure if it's related:
   
   ```
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,634 
[main-SendThread(fab10.umf.prod.ostk.com:2181)] WARN 
org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd038f for sever 
fab10.umf.prod.ostk.com/10.20.69.29:2181, Closing socket connection. Attempting 
reconnect except it is a SessionExpiredException.
   May 12 21:51:16 fab10 3ed25e05b45a: 
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed 
out, have not heard from server in 10302ms for session id 0x1028ed903dd038f
   May 12 21:51:16 fab10 3ed25e05b45a: #011at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) 
[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,634 
[main-SendThread(fab10.umf.prod.ostk.com:2181)] WARN 
org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever 
fab10.umf.prod.ostk.com/10.20.69.29:2181, Closing socket connection. Attempting 
reconnect except it is a SessionExpiredException.
   May 12 21:51:16 fab10 3ed25e05b45a: 
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed 
out, have not heard from server in 12554ms for session id 0x1028ed903dd0390
   May 12 21:51:16 fab10 3ed25e05b45a: #011at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) 
[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:16 fab10 cb3b837bde84: 2021-05-13T03:51:16,635 
[NIOWorkerThread-72] WARN org.apache.zookeeper.server.NIOServerCnxn - Unable to 
read additional data from client sessionid 0x1028ed903dd038f, likely client has 
closed socket
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,637 
[pulsar-io-28-20] INFO org.apache.pulsar.broker.service.ServerCnx - Closed 
connection from /10.20.17.40:44725
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,637 
[pulsar-io-28-21] INFO org.apache.pulsar.broker.service.ServerCnx - Closed 
connection from /10.20.17.41:58126
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,639 
[pulsar-io-28-19] INFO org.apache.pulsar.broker.service.ServerCnx - Closed 
connection from /10.20.17.41:45541
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,780 
[pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - 
[/10.20.69.28:36462] Received cmd PING
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,780 
[pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 
0x1f7c0cfe, L:/10.20.69.29:6650 - R:/10.20.69.28:36462]] Replying back to ping 
message
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,780 
[pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - 
[/10.20.69.28:36462] Received cmd PONG
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,036 
[pulsar-io-28-19] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 
0xb055a726, L:/10.20.69.29:6651 - R:/10.20.69.28:36414]] Sending ping message
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,551 
[main-SendThread(fab09.umf.prod.ostk.com:2181)] WARN 
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
session 0x1028ed903dd0390 has expired
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,551 
[main-EventThread] ERROR org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - 
ZooKeeper client connection to the ZooKeeper server has expired!
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,551 
[main-SendThread(fab09.umf.prod.ostk.com:2181)] WARN 
org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever 
fab09.umf.prod.ostk.com/10.20.69.28:2181, Closing socket connection. Attempting 
reconnect except it is a SessionExpiredException.
   May 12 21:51:17 fab10 3ed25e05b45a: 
org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to 
ZooKeeper service, session 0x1028ed903dd0390 has expired
   May 12 21:51:17 fab10 3ed25e05b45a: #011at 
org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1419) 
~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at 
org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154)
 ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86) 
~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) 
[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to