semistone opened a new issue, #22601: URL: https://github.com/apache/pulsar/issues/22601
### Search before asking - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### Version - 3.2.2 - 3.1.2 ### Minimal reproduce step publish event in about 6k QPS and 100Mbits/sec with metaData BatcherBuilder.KEY_BASED mode and producer and send message by high concurrent/parallel producer process. it happens only in almost real time consumer (almost zero backlog) ### What did you expect to see? no lost event ### What did you see instead? could see error log in broker and show Failed to peek sticky key from the message metadata it look like thread safe issue, because it happen randomly. in 1M events, it only happen few times. ### Anything else? the error similar to https://github.com/apache/pulsar/issues/10967 but I think it's different issue. the data in bookkeeper is correct. I can download the event from bookkeeper and parse it successfully. or consume the same event few minutes later and it could consume successfully. but all subscriptions will get the same error in the same event in real time consumer(zero backlog). I have trace source code. it happens in PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer -> Commands.peekAndCopyMessageMetadata and I also print the ByteBuf contents, it's I could clear see the data isn't the same in bookkeeper in normal event , the hex code usually start by 010e (magicCrc32c) ---- 0000000 010e 9529 5fbc 0000 0a03 3a0a 6e69 7267 ---- in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data ---- 0000000 0000 a610 0000 0000 0200 7239 0000 0000 <== from here 0000020 0200 1339 0000 0000 ea17 a8b0 8b8e fa5e 0000040 2af0 2675 f645 1623 d17e dc34 526d ef44 <=== until here is garbage 0000060 010e 9529 5fbc 0000 0a03 3a0a 6e69 7267 <== from here is normal data ----- this is just an example, but sometimes the first new bytes are correct and something wrong after first new bytes. I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code. Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you. ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org