conor-nsurely opened a new issue, #23890: URL: https://github.com/apache/pulsar/issues/23890
### Search before asking - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [x] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### Version Pulsar 4.0.1 Not OS specific We use pulsar-go client, but not a factor here ### Minimal reproduce step I am not sure how to reproduce this. I suspect it occurred when there were multiple restarts occurring across the pulsar cluster(Bookie, brokers, Zookeeper). Restarts not caused by Pulsar, but scaling up and down of nodes. ### What did you expect to see? autoSkipNonRecoverableData is set to true so I had expected the broker to ignore the ledger errors and startup successfully. ### What did you see instead? The broker(s) crash when trying to startup, the cluster is down From the broker.conf ``` # Skip reading non-recoverable/unreadable data-ledger under managed-ledger's list. It helps when data-ledgers gets # corrupted at bookkeeper and managed-cursor is stuck at that ledger. autoSkipNonRecoverableData=true ``` Here are some of the errors I am seeing `││ pulsar-broker org.apache.bookkeeper.mledger.ManagedLedgerException: Error while reading ledger error code: -1 ││ pulsar-broker 2025-01-22T12:52:24,865+0000 [broker-topic-workers-OrderedExecutor-0-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispat ││ cherSingleActiveConsumer - [persistent://public/functions/metadata / c-pulsar-fw-pulsar-broker-1.pulsar-broker.default.svc.cluster.local-8080-function-m ││ etadata-tailer-reader-c968c95506-Consumer{subscription=PersistentSubscription{topic=persistent://public/functions/metadata, name=c-pulsar-fw-pulsar-brok ││ er-1.pulsar-broker.default.svc.cluster.local-8080-function-metadata-tailer-reader-c968c95506}, consumerId=1, consumerName=c-pulsar-fw-pulsar-broker-1.pu ││ lsar-broker.default.svc.cluster.local-8080-function-metadata-tailer, address=[id: 0xb55fc6c8, L:/10.196.5.38:6650 - R:/10.196.5.38:33450] [SR:10.196.5.3 ││ , state:Connected[]}] Error reading entries at 1508619:54 : Error while reading ledger error code: -1 - Retrying to read in 54.316 seconds ` `pulsar-broker 2025-01-22T12:54:38,036+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.bookkeeper.client.ReadOpBase - Error: Error whi ││ le reading ledger while reading L1533609 E0 from bookie: pulsar-bookie-1.pulsar-bookie.default.svc.cluster.local:3181 ` `2025-01-22T12:56:59,600+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledge ││ r entry failed: L1533609 E0-E0, Sent to [pulsar-bookie-1.pulsar-bookie.default.svc.cluster.local:3181], Heard from [] : bitset = {}, Error = 'Error whil ││ e reading ledger'. First unread entry is (-1, rc = null) ` ### Anything else? Based on another issue, I deleted one or two ledgers mentioned in the logs to see if that would make a difference. However I didn't keep deleting, as I wish to find a better solution, in case this happens in our production environments. Since the errors messages refer to bookie-1, I tried scaling down the cluster to 1 broker, bookie, zookeeper. This did not resolve the issue. Occasionally I have seen the ledger error code be -10, rather than -1 Thanks in advance for any advice ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
