fretory opened a new issue, #24436:
URL: https://github.com/apache/pulsar/issues/24436

   ### Search before reporting
   
   - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Read release policy
   
   - [x] I understand that [unsupported 
versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions)
 don't get bug fixes. I will attempt to reproduce the issue on a supported 
version of Pulsar client and Pulsar broker.
   
   
   ### User environment
   
   
   
   
   * **Broker Version**: 3.0.5
   
   *  **Deployment**: Kubernetes with Docker
   
   *  **Problem Description**
   
   We have enabled the following features in our Pulsar cluster:
   
   ```yaml
   systemTopicEnabled: "true"
   topicLevelPoliciesEnabled: "true"
   
   managedLedgerDefaultAckQuorum: "2"
   managedLedgerDefaultEnsembleSize: "2"
   managedLedgerDefaultWriteQuorum: "2"
   ```
   
   
   
   
   
   
   
   
   
   
   ### Issue Description
   
   After configuring some topic policies, the cluster experienced several 
restarts or other operations over a period of time. We confirmed that no data 
was manually modified in BookKeeper during this period.
   
   Subsequently, we observed "Failed to read entries" errors on the system 
topic `__change_events`. This issue then blocked the creation of both consumers 
and producers, rendering the topic service unavailable.
   
   
   ### Error messages
   
   ```text
   
![Image](https://github.com/user-attachments/assets/b251fdaf-f4b0-4879-bb16-b91b8c82c99f)
   
   
   Unfortunately, we do not have comprehensive logs from the exact time of the 
incident. However, this issue has occurred multiple times recently. We will 
ensure to collect more detailed logs if it recurs.
   ```
   
   ### Reproducing the issue
   
   We don't have a stable way to reproduce this issue. Currently, we've 
observed that it **occurs with a higher probability in host restart scenarios.**
   
   
   ### Additional information
   
   ### **Workaround**
   
   To resolve this, we followed these steps:
   
   1.  **Deleted the broken ledger** on the corresponding BookKeeper (BK) node.
   2.  After deletion, we might observe logs indicating "No such ledger exists 
on Metadata Server".
   3.  **Deleted the `/schemas/tenant/namespace/__change_events` entry** in 
ZooKeeper.
   4.  **Restarted the broker**.
   
   After performing these steps, the cluster recovered.
   
   ### **Questions**
   
   1.  What could be the **root cause** for the ledger of the `__change_events` 
system topic becoming corrupted?
   2.  It seems **unreasonable** that a broken ledger in `__change_events` 
leads to the entire topic service becoming unavailable. Could there be an 
enhancement to detect such ledger corruption and **automatically 
reload/recreate** the necessary components to prevent service disruption?
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to