[GitHub] [pulsar] zymap opened a new issue, #17516: [Bug] Consumer trying to read a non-exist ledger

GitBox Wed, 07 Sep 2022 03:02:19 -0700


zymap opened a new issue, #17516:
URL: https://github.com/apache/pulsar/issues/17516

### Search before asking

- [X] I searched in the [issues](https://github.com/apache/pulsar/issues)
and found nothing similar.

### Version

2.8.1

### Minimal reproduce step

Not reproduced yet.

### What did you expect to see?

The offload data shouldn't lose.

### What did you see instead?

**The error message we saw in the broker is caused by the consumer trying to
read a non-existing ledger.**

```
2022-08-16T09:22:04.791520867Z 09:22:04.789 [offloader-OrderedScheduler-0-0]
ERROR
org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader
- Failed readOffloaded:
[pod/production-pulsar-broker-0/production-pulsar-broker]
2022-08-16T09:22:04.791545871Z java.lang.NullPointerException: null
[pod/production-pulsar-broker-0/production-pulsar-broker]
2022-08-16T09:22:04.791548867Z at
org.apache.bookkeeper.mledger.offload.jcloud.impl.DataBlockUtils.lambda$static$0(DataBlockUtils.java:73)
~[?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker]
2022-08-16T09:22:04.791551588Z at
org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.open(BlobStoreBackedReadHandleImpl.java:237)
~[?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker]
2022-08-16T09:22:04.791554123Z at
org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader.lambda$readOffloaded$3(BlobStoreManagedLedgerOffloader.java:506)
~[?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker]
2022-08-16T09:22:04.791556370Z at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker]
2022-08-16T09:22:04.791565338Z at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
[com.g
oogle.guava-guava-30.1-jre.jar:?]
```

## Details:

we saw some ledgers are offloaded but not seen in the S3.

For example, we found the ledger 1025044 logs. We saw the ledger has been
offloaded failed and the offload is trying to clean up the previous offload
information and then try to offload it again. But we didn't see successfully
offloaded logs.

**The weird thing is, the logs show the ledger is failed to offload but the
metadata shows it offload successfully.**

![Screen Shot 2022-09-07 at 17 56
04](https://user-images.githubusercontent.com/24502569/188850625-db7a4707-f826-4376-8a1e-50a31c65d84b.png)

In Pulsar offloader implementation, the offload process has three steps.

First, prepare the metadata. The offloader will generate a UUID for the
offload ledger and using it get a filename used to offload to the cloud
storage. Then it will persist the offload context(including UUID and offloader
information, such as offloader name, bucket, and so on) into the meta store
(zookeeper).

Second, offload the ledger. It starts offloading the ledger into the cloud
storage. The index file name is UUID-ledger-<ledgerId>-index, and the data file
name is UUID-ledger<ledgerId>-data.

Once the ledger offloads successfully, it will update the `complete` in the
LedgerInfo.OffloadContext to `true` and then persist it into the meta store.
When it persists successfully, it will update the ledgers map in the memory to
the latest status. From now, if there has a consumer who wants to read the
ledger, it will read from the tiered storage.

If there has a failure in any steps, it will clean up the offloaded files
and do the process again.

---

Through the logs, we found the zookeeper is not normal since 8:00. Then we
found when the ledger run the third step, it failed and throws an exception:

`Caused by: org.apache.pulsar.metadata.api.MetadataStoreException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for
/managed-ledgers/navision/product/persistent/ordered.product_category.navision_model.events-partition-0
`
And then, the pulsar offloader cannot send the request to the zookeeper
successfully. It causes the offload process will do again and again, and what
we see in the logs is the ledger continues to offload but never succeed.

There has a case the client will throw `ConnectionLoss`, that is the request
sent to the server successfully, but the client was closed for some other
reason and can't receive the response.

This will explain why we see in the metadata, that the offload `complete` is
true, but we can not find it in the S3.
It updates the metadata successfully but returns an error, then our
offloader cleans up the ledger. Then the consumer won't get the message.

### Anything else?

_No response_

### Are you willing to submit a PR?

- [ ] I'm willing to submit a PR!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] zymap opened a new issue, #17516: [Bug] Consumer trying to read a non-exist ledger

Reply via email to