lhotari commented on issue #22601: URL: https://github.com/apache/pulsar/issues/22601#issuecomment-2123898015
I couldn't reproduce with Pulsar Standalone, but I have a way with a local Microk8s cluster where I could also attach a debugger. With break points in java.lang.IllegalArgumentException and java.nio.BufferUnderflowException, I can see the problem. <img width="1329" alt="image" src="https://github.com/apache/pulsar/assets/66864/8524afa2-0c22-4ca9-aaff-30135566d781"> This issue happens when `.copy()` is called on this line: https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/ByteBufPair.java#L149 There's a feature in Netty that `.copy()` isn't thread safe. If it's called from multiple threads at a time, there will be a race condition. This happens here in Netty code: https://github.com/netty/netty/blob/243de91df2e9a9bf0ad938f54f76063c14ba6e3d/buffer/src/main/java/io/netty/buffer/ReadOnlyByteBufferBuf.java#L412-L433 `io.netty.buffer.ReadOnlyByteBufferBuf#internalNioBuffer()` returns a shared instance which gets corrupted. One could argue that this is a bug in ReadOnlyByteBufferBuf. At least this this extremely surprising behavior. `.copy()` was added in https://github.com/apache/pulsar/issues/2401 . It looks like the root cause wasn't properly fixed and the problem moved to a different location. In Netty, the SslHandler will access the underlying ByteBuffer instances directly. This leads to a similar multi-threading problem as the use of `.copy()`. I think that the problem is now clear where it happens, but the solution to fix this isn't yet known. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org