devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-800868920


   I created a test cluster (on fast hardware) specifically for reproducing 
this issue. In our very simple function flow, using simple Java functions 
without external dependencies, on Pulsar **2.6.3**, as soon as we started 
flowing data (around 4k msg/sec at 140KB/msg average), within seconds the bug 
appeared (as expected), blocking up the flow, and causing a backlog to 
accumulate.
   
   I looked up the broker the frozen topic was running on and got heap dumps 
and thread dumps of the broker and functions running on that broker. 
   There was nothing abnormal in the thread dumps that I could find. However, 
the topic stats and internal stats seem to have some clues. 
   
   I've attached the topic stats and internal topic stats of the topic upstream 
from the frozen topic, the frozen topic, and the topic immediately downstream 
from the frozen topic. 
   The flow looks like this:
   
   -> `first-function` -> `first-topic` -> `second-function` -> `second-topic` 
-> `third-function` -> `third-topic` -> fourth-function -> 
   
   The `second-topic` is the one that froze and started accumulating backlog. 
   
   The `second-topic` reports -45 available permits for its consumer, 
`third-function`. 
   All three topics report 0 pendingReadOps. 
   `first-topic` and `third-topic` have waitingReadOp = true, indicating the 
subscriptions are waiting for messages. 
   `second-topic` has waitingReadOp = false, indicating its subscription hasn't 
caught up or isn't waiting for messages. 
   `second-topic` reports waitingCursorsCount = 0, so it has no cursors waiting 
for messages. 
   
   `third-topic` has pendingAddEntriesCount = 81, indicating it's waiting for 
write requests to complete. 
   `first-topic` and `second-topic` have pendingAddEntriesCount = 0
   
   `third-topic` is in the state ClosingLedger. 
   `first-topic` and `second-topic` are in the state LedgerOpened
   
   `second-topic`'s cursor has markDeletePosition = 17525:0 and `readPosition` 
= 17594:9
        `third-topic`'s cursor has markDeletePosition = 17551:9 and 
`readPosition` = 17551:10
   
   So, the `third-topic`'s cursor's `readPosition` is adjacent to its 
markDeletePosition. 
   However, `second-topic`'s cursor's `readPosition` is farther ahead than 
`third-topic`'s `readPosition`. 
   
   Is that unusual for a downstream topic's cursor to have a `readPosition` 
farther ahead (larger number) than the `readPosition` of the topic immediately 
upstream from it when the downstream topic's only source of messages is that 
upstream topic and not more than a few hundred thousand messages have been sent 
through the pipe?
   
   Github won't let me attach .json files, so I had to make them txt files. In 
the attached zip, they have the .json extension for convenience when viewing.
   
[first-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154489/first-topic-internal-stats.json.txt)
   
[first-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154490/first-topic-stats.json.txt)
   
[second-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154491/second-topic-internal-stats.json.txt)
   
[second-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154492/second-topic-stats.json.txt)
   
[third-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154493/third-topic-internal-stats.json.txt)
   
[third-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154494/third-topic-stats.json.txt)
   
   [stats.zip](https://github.com/apache/pulsar/files/6154471/stats.zip)
   
   I've also attached the broker thread dump: 
[thread_dump_3-16.txt](https://github.com/apache/pulsar/files/6154506/thread_dump_3-16.txt)
   
   Regarding the heap dump, I can't attach that until I'm able to reproduce 
this bug with synthetic data, but in the meantime, if anyone wants me to look 
up specific things in the heap dump, I'll be happy to do that. 
   I can also inspect the heap dump of the `third-function` in case that might 
provide additional info.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to