semistone opened a new issue, #22601:
URL: https://github.com/apache/pulsar/issues/22601

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Read release policy
   
   - [X] I understand that unsupported versions don't get bug fixes. I will 
attempt to reproduce the issue on a supported version of Pulsar client and 
Pulsar broker.
   
   
   ### Version
   
   - 3.2.2
   - 3.1.2
   
   ### Minimal reproduce step
   
   publish event in about 6k QPS and 100Mbits/sec
   with metaData
   BatcherBuilder.KEY_BASED mode
   and producer and send message by high concurrent/parallel producer process.
   it happens only in almost real time consumer (almost zero backlog)
   
   ### What did you expect to see?
   
   no lost event
   
   ### What did you see instead?
   
   could see error log in broker and show 
   Failed to peek sticky key from the message metadata
   
   it look like thread safe issue, because it happen randomly. 
   in 1M events, it only happen few times.
   
   
   
   ### Anything else?
   
   the error similar to 
   https://github.com/apache/pulsar/issues/10967 but I think it's different 
issue.
   
   the data in bookkeeper is correct.
   I can download the event from bookkeeper and parse it successfully.
   or consume the same event few minutes later and it could consume 
successfully.
   but all subscriptions will get the same error in the same event in real time 
consumer(zero backlog).
   
   
   I have trace source code.
   it happens in 
   PersistentDispatcherMultipleConsumers.readEntriesComplete -> 
AbstractBaseDispatcher.filterEntriesForConsumer
    -> Commands.peekAndCopyMessageMetadata
   
   and I also print the ByteBuf contents,
   it's I could clear see the data isn't the same in bookkeeper
   
   in normal event , the hex code usually start by 010e (magicCrc32c)
   ----
   0000000      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267
   ----
   in one of our error event, the bytebuf have about 48 bytes strange data, 
then continue with normal data
   
   ----
   0000000      0000    a610    0000    0000    0200    7239    0000    0000 
<== from here
        
   0000020      0200    1339    0000    0000    ea17    a8b0    8b8e    fa5e
           
   0000040      2af0    2675    f645    1623    d17e    dc34    526d    ef44 
<=== until here is garbage
             
   0000060      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267 
<== from here is normal data
   -----
   
   this is just an example, but sometimes the first new bytes are correct and 
something wrong after first new bytes.
   
   
   I am still trying to debug when and how the ByteBuf returns incorrect data, 
and why it only happens during stress testing. It is still not easy to 
reproduce using the perf tool, but we can 100% reproduce it in our producer 
code.
   
   Does anyone have any idea what could be causing this issue, and any 
suggestions on which library or class may have potential issues? Additionally, 
any suggestions on how to debug this issue or if I need to print any specific 
information to help identify the root cause would be appreciated. Thank you.
   
   
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to