zymap opened a new issue, #17516:
URL: https://github.com/apache/pulsar/issues/17516

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Version
   
   2.8.1
   
   ### Minimal reproduce step
   
   Not reproduced yet.
   
   ### What did you expect to see?
   
   The offload data shouldn't lose.
   
   ### What did you see instead?
   
   **The error message we saw in the broker is caused by the consumer trying to 
read a non-existing ledger.**
   
   ```
   2022-08-16T09:22:04.791520867Z 09:22:04.789 [offloader-OrderedScheduler-0-0] 
ERROR 
org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader
 - Failed readOffloaded:
   [pod/production-pulsar-broker-0/production-pulsar-broker] 
2022-08-16T09:22:04.791545871Z java.lang.NullPointerException: null
   [pod/production-pulsar-broker-0/production-pulsar-broker] 
2022-08-16T09:22:04.791548867Z        at 
org.apache.bookkeeper.mledger.offload.jcloud.impl.DataBlockUtils.lambda$static$0(DataBlockUtils.java:73)
 ~[?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 
2022-08-16T09:22:04.791551588Z        at 
org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.open(BlobStoreBackedReadHandleImpl.java:237)
 ~[?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 
2022-08-16T09:22:04.791554123Z        at 
org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader.lambda$readOffloaded$3(BlobStoreManagedLedgerOffloader.java:506)
 ~[?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 
2022-08-16T09:22:04.791556370Z        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 
2022-08-16T09:22:04.791565338Z        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
 [com.g
   oogle.guava-guava-30.1-jre.jar:?]
   ```
   
   ##  Details: 
   
   we saw some ledgers are offloaded but not seen in the S3. 
   
   For example, we found the ledger 1025044 logs. We saw the ledger has been 
offloaded failed and the offload is trying to clean up the previous offload 
information and then try to offload it again. But we didn't see successfully 
offloaded logs.
   
   **The weird thing is, the logs show the ledger is failed to offload but the 
metadata shows it offload successfully.**
   
   ![Screen Shot 2022-09-07 at 17 56 
04](https://user-images.githubusercontent.com/24502569/188850625-db7a4707-f826-4376-8a1e-50a31c65d84b.png)
   
   In Pulsar offloader implementation, the offload process has three steps. 
   
   First, prepare the metadata. The offloader will generate a UUID for the 
offload ledger and using it get a filename used to offload to the cloud 
storage. Then it will persist the offload context(including UUID and offloader 
information, such as offloader name, bucket, and so on) into the meta store 
(zookeeper).
   
   Second, offload the ledger. It starts offloading the ledger into the cloud 
storage. The index file name is UUID-ledger-<ledgerId>-index, and the data file 
name is UUID-ledger<ledgerId>-data.
   
   Once the ledger offloads successfully, it will update the `complete` in the 
LedgerInfo.OffloadContext to `true` and then persist it into the meta store. 
When it persists successfully, it will update the ledgers map in the memory to 
the latest status. From now, if there has a consumer who wants to read the 
ledger, it will read from the tiered storage.
   
   If there has a failure in any steps, it will clean up the offloaded files 
and do the process again.
   
   ---
   
   Through the logs, we found the zookeeper is not normal since 8:00. Then we 
found when the ledger run the third step, it failed and throws an exception:
   
   `Caused by: org.apache.pulsar.metadata.api.MetadataStoreException: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/managed-ledgers/navision/product/persistent/ordered.product_category.navision_model.events-partition-0
   `
   And then, the pulsar offloader cannot send the request to the zookeeper 
successfully. It causes the offload process will do again and again, and what 
we see in the logs is the ledger continues to offload but never succeed.
   
   There has a case the client will throw `ConnectionLoss`, that is the request 
sent to the server successfully, but the client was closed for some other 
reason and can't receive the response.
   
   This will explain why we see in the metadata, that the offload `complete` is 
true, but we can not find it in the S3.  
   It updates the metadata successfully but returns an error, then our 
offloader cleans up the ledger.  Then the consumer won't get the message.
   
   
   
   
   
   
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to