thetumbled opened a new issue, #21916: URL: https://github.com/apache/pulsar/issues/21916
### Search before asking - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Version 2.9 ### Minimal reproduce step stack: ``` 18:00:50.888 [pulsar-transaction-executor-17-1] ERROR org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - [persistent://public/default/g-partition-0] Transaction buffer recover fail by read transactionBufferSnapshot timeout! java.util.concurrent.TimeoutException: null at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?] at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4] at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final] at java.lang.Thread.run(Thread.java:834) ~[?:?] 18:00:50.888 [pulsar-transaction-executor-17-1] WARN org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - Closing topic persistent://public/default/g-partition-0 due to read transaction buffer snapshot while recovering the transaction buffer throw exception org.apache.pulsar.broker.service.BrokerServiceException$ServiceUnitNotReadyException: [persistent://public/default/g-partition-0] Transaction buffer recover fail by read transactionBufferSnapshot timeout! at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:713) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4] at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final] at java.lang.Thread.run(Thread.java:834) ~[?:?] Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?] at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4] ... 6 more 18:00:51.189 [pulsar-io-32-32] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://public/default/__transaction_buffer_snapshot-partition-2] [multiTopicsReader-ef4f7acfd4] Closed consumer ``` topic snapshot recover fail and can't server for client because of the timeout exception for read. But why the read operation will time out? I find that the method `reader.hasMessageAvailable()` return true, but `reader.readNext` can't return any messages. Further, i notice that the reader try to read messages from compacted topic because there is no content in the normal topic backlog. so i find out the compacted ledger id as follows:  I read out the entry of compacted ledger from bookie and deserialize the data, find out that all messages in these 6 entry is `compactd out`, which means there is no payload in it and will be dropped by client automatically.  So **the reason of the problem is clear:** `reader.hasMessageAvailable()` get the last message id in the compacted topic, which may be a completely `compacted out` message, the code pass through the `reader.hasMessageAvailable()` check, but it will not receive any valid data, so the topic recovery code stuck. But all messages in one entry is compacted out is out ot expect, because we have some filtering logic in `org.apache.pulsar.client.impl.RawBatchConverter#rebatchMessage`: ``` if (messagesRetained > 0) { ... return result; } else { return Optional.empty(); } ``` If all messages in one entry is compacted out, we the variable `messagesRetained` should 0, and we should return an empty message back to write. But it did exist some entries with all compacted messages! So there must be a bug in it. I find out a bug that can result into this situations: - At the first time of compaction, we write a entry with some valid messages and some compacted messages, such as ``` Entry1{ msg1{payload}, msg2{empty}} ``` - At the second time of compaction, we read `Entry1` and find that the `msg1` need to be compacted, so we execute ``` Commands.serializeSingleMessageInBatchWithPayload(emptyMetadata, Unpooled.EMPTY_BUFFER, batchBuffer); ``` and keep `messagesRetained` be 0. Then we meet `msg2{empty}`, it do not have the partition key as well! For example:  So we enter following logic: ``` if (!singleMessageMetadata.hasPartitionKey()) { messagesRetained++; Commands.serializeSingleMessageInBatchWithPayload(singleMessageMetadata, singleMessagePayload, batchBuffer); } ``` we increment the `messagesRetained` but we write a empty payload! As the `messagesRetained` is greater than 0, we pass through the check of `messagesRetained`, we write such a empty entry into the compacted topic. ### What did you expect to see? ~ ### What did you see instead? ~ ### Anything else? _No response_ ### Are you willing to submit a PR? - [X] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
