Shawyeok opened a new issue #13693: URL: https://github.com/apache/pulsar/issues/13693
**Describe the bug** In our staging cluster, some bookie instances keep logging `No ledger found while reading entry: xx from ledger: xxxx`, I did some investigation and found the internal metadata of `LedgerHandle`(org.apache.bookkeeper.client.LedgerHandle#versionedMetadata) of specific ledgers are inconsistent with latest ledger metadata in zookeeper. **Screenshots** Bookie logs:  On that bookie, it says the `readEntry` request is from `172.30.10.3:48336` which is a broker instance. ```shell [arthas@2278]$ watch org.apache.bookkeeper.proto.ReadEntryProcessorV3 getReadResponse {target.ledgerId,target.entryId,target.channel,returnObj.status} returnObj.status_!=0 Press Q or Ctrl+C to abort. Affect(class count: 2 , method count: 1) cost in 225 ms, listenerId: 1 method=org.apache.bookkeeper.proto.ReadEntryProcessorV3.getReadResponse location=AtExit ts=2022-01-10 14:07:30; [cost=1.716101ms] result=@ArrayList[ @Long[4976503], @Long[0], @EpollSocketChannel[[id: 0xce78ecda, L:/172.30.10.2:3181 - R:/172.30.10.3:48336]], @StatusCode[ENOLEDGER], ] ``` On broker instance `172.30.10.3:8080`, it says `ensembles` are `172.30.10.5:3181, 172.30.10.2:3181`: ```shell [arthas@2045]$ watch org.apache.bookkeeper.client.LedgerHandle asyncReadEntries {target.ledgerId,target.versionedMetadata} target.ledgerId==4976503 Press Q or Ctrl+C to abort. Affect(class count: 3 , method count: 1) cost in 289 ms, listenerId: 2 method=org.apache.bookkeeper.client.LedgerHandle.asyncReadEntries location=AtExit ts=2022-01-10 15:27:59; [cost=0.094076ms] result=@ArrayList[ @Long[4976503], @Versioned[Versioned(value=LedgerMetadata{formatVersion=3, ensembleSize=2, writeQuorumSize=2, ackQuorumSize=2, state=CLOSED, length=300, lastEntryId=0, digestType=CRC32, password=base64:, ensembles={0=[172.30.10.5:3181, 172.30.10.2:3181]}, customMetadata={pulsar/compactedTopic=base64:cGVyc2lzdGVudDovL2NhbXB1cy9ic2l0ZS9fX2NoYW5nZV9ldmVudHMtcGFydGl0aW9uLTE=, component=base64:Y29tcGFjdGVkLWxlZGdlcg==, application=base64:cHVsc2Fy, pulsar/compactedTo=base64:CNnerwIQARgBMAA=}}, version=1)], ] ``` But the latest ledger metadata shows `ensembles` are `172.30.10.4:3181, 172.30.10.3:3181`: ```shell # curl -sS 'http://172.30.10.2:8000/api/v1/ledger/metadata?ledger_id=4976503' { "4976503" : { "ledgerId" : 4976503, "metadataFormatVersion" : 3, "ensembleSize" : 2, "writeQuorumSize" : 2, "ackQuorumSize" : 2, "state" : "CLOSED", "length" : 300, "lastEntryId" : 0, "ctime" : 1640845770828, "digestType" : "CRC32", "password" : "", "customMetadata" : { "pulsar/compactedTopic" : "cGVyc2lzdGVudDovL2NhbXB1cy9ic2l0ZS9fX2NoYW5nZV9ldmVudHMtcGFydGl0aW9uLTE=", "component" : "Y29tcGFjdGVkLWxlZGdlcg==", "application" : "cHVsc2Fy", "pulsar/compactedTo" : "CNnerwIQARgBMAA=" }, "closed" : true, "allEnsembles" : { "0" : [ { "id" : "172.30.10.4:3181" }, { "id" : "172.30.10.3:3181" } ] }, "ctoken" : 0 } } ``` So the broker keep sent readEntry requests to wrong bookies which are not in ledger ensemble list. **Desktop (please complete the following information):** - OS: ```shell # cat /etc/redhat-release CentOS Linux release 7.8.2003 (Core) ``` **Additional context** Bookie auto-recovery is **enabled**. Machine `172.30.10.2` has crashed once at `2022-01-03 17:40:08 +08:00`, and bookie instance started at `2022-01-03 19:35:08 +08:00`, `No ledger found` log start shows after `172.30.10.2` crashed. Possible related https://github.com/apache/pulsar/issues/7214 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
