thetumbled opened a new issue, #21916:
URL: https://github.com/apache/pulsar/issues/21916

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Version
   
   2.9
   
   ### Minimal reproduce step
   
   stack:
   ```
   18:00:50.888 [pulsar-transaction-executor-17-1] ERROR 
org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - 
[persistent://public/default/g-partition-0] Transaction buffer recover fail by 
read transactionBufferSnapshot timeout!
   java.util.concurrent.TimeoutException: null
           at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) 
~[?:?]
           at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
           at 
org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692)
 ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
           at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714)
 ~[?:?]
           at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 ~[?:?]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final]
           at java.lang.Thread.run(Thread.java:834) ~[?:?]
   18:00:50.888 [pulsar-transaction-executor-17-1] WARN  
org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - 
Closing topic persistent://public/default/g-partition-0 due to read transaction 
buffer snapshot while recovering the transaction buffer throw exception
   
org.apache.pulsar.broker.service.BrokerServiceException$ServiceUnitNotReadyException:
 [persistent://public/default/g-partition-0] Transaction buffer recover fail by 
read transactionBufferSnapshot timeout!
           at 
org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:713)
 ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
           at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714)
 ~[?:?]
           at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 ~[?:?]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final]
           at java.lang.Thread.run(Thread.java:834) ~[?:?]
   Caused by: java.util.concurrent.TimeoutException
           at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) 
~[?:?]
           at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
           at 
org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692)
 ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
           ... 6 more
   
   18:00:51.189 [pulsar-io-32-32] INFO  
org.apache.pulsar.client.impl.ConsumerImpl - 
[persistent://public/default/__transaction_buffer_snapshot-partition-2] 
[multiTopicsReader-ef4f7acfd4] Closed consumer
   ```
   
   topic snapshot recover fail and can't server for client because of the 
timeout exception for read. But why the read operation will time out? I find 
that the method `reader.hasMessageAvailable()` return true, but 
`reader.readNext` can't return any messages.
   
   Further, i notice that the reader try to read messages from compacted topic 
because there is no content in the normal topic backlog. so i find out the 
compacted ledger id as follows:
   
   
![image](https://github.com/apache/pulsar/assets/52550727/aac6de79-c5c7-4dac-a6b4-ec906f046fde)
   
   I read out the entry of compacted ledger from bookie and deserialize the 
data, find out that all messages in these 6 entry is `compactd out`, which 
means there is no payload in it and will be dropped by client automatically.
   
![image](https://github.com/apache/pulsar/assets/52550727/f74adfde-8d31-41ad-ae11-2786a4b624e9)
   
   
   So **the reason of the problem is clear:** `reader.hasMessageAvailable()` 
get the last message id in the compacted topic, which may be a completely 
`compacted out` message, the code pass through the 
`reader.hasMessageAvailable()` check, but it will not receive any valid data, 
so the topic recovery code stuck.
   
   But all messages in one entry is compacted out is out ot expect, because we 
have some filtering logic in 
`org.apache.pulsar.client.impl.RawBatchConverter#rebatchMessage`:
   ```
               if (messagesRetained > 0) {
                   ...
                   return result;
               } else {
                   return Optional.empty();
               }
   ```
   If all messages in one entry is compacted out, we the variable 
`messagesRetained` should 0, and we should return an empty message back to 
write.
   But it did exist some entries with all compacted messages! So there must be 
a bug in it.
   
   I find out a bug that can result into this situations:
   - At the first time of compaction, we write a entry with some valid messages 
and some compacted messages, such as
   ```
   Entry1{ msg1{payload}, msg2{empty}}
   ```
   - At the second time of compaction, we read `Entry1` and find that the 
`msg1` need to be compacted, so we execute
   ```
   Commands.serializeSingleMessageInBatchWithPayload(emptyMetadata,
                                                                         
Unpooled.EMPTY_BUFFER, batchBuffer);
   ```
   and keep `messagesRetained` be 0. Then we meet `msg2{empty}`, it do not have 
the partition key as well! For example:
   
![image](https://github.com/apache/pulsar/assets/52550727/c06116f8-70ee-440c-aaff-571b1be89e59)
   So we enter following logic:
   ```
                   if (!singleMessageMetadata.hasPartitionKey()) {
                       messagesRetained++;
                       
Commands.serializeSingleMessageInBatchWithPayload(singleMessageMetadata,
                                                                         
singleMessagePayload, batchBuffer);
                   }
   ```
   we increment the `messagesRetained` but we write a empty payload! As the 
`messagesRetained` is greater than 0, we pass through the check of 
`messagesRetained`, we write such a empty entry into the compacted topic. 
   
   
   
   
   ### What did you expect to see?
   
   ~
   
   ### What did you see instead?
   
   ~
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to