horizonzy opened a new issue, #21058: URL: https://github.com/apache/pulsar/issues/21058
### Search before asking - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Version The newest master. ### Minimal reproduce step 1. Start AutoRecovery using `bin/bookkeeper recovery` 2. Stop one bookie ### What did you expect to see? ### Superficial cause ``` 2023-08-21T13:10:47,207-0400 [ReplicationWorker] DEBUG org.apache.bookkeeper.replication.ReplicationWorker - Going to replicate the fragments of the ledger: 36955355 2023-08-21T13:10:48,889-0400 [ReplicationWorker] DEBUG org.apache.bookkeeper.replication.ReplicationWorker - Founds fragments [Fragment(LedgerID: 36955355, FirstEntryID: 0[0], LastKnownEntryID: 735[735], Host: [bookie-host.com:3181], Closed: true)] for replication from ledger: 36955355 2023-08-21T13:10:48,911-0400 [ReplicationWorker] INFO org.apache.bookkeeper.client.LedgerFragmentReplicator - Replicating fragment Fragment(LedgerID: 36955355, FirstEntryID: 0[0], LastKnownEntryID: 735[735], Host: [bookie-host.com:3181], Closed: true) in 1 sub fragments. 2023-08-21T13:10:49,393-0400 [main-EventThread] INFO org.apache.bookkeeper.client.LedgerFragmentReplicator - Updated ZK to point ledger fragments from old bookies to new bookies: {bookie-host.com:3181=bookie-host1.com:3181} 2023-08-21T13:10:49,401-0400 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - Ledger replicated successfully. ledger id is: 36955355 2023-08-21T13:10:49,401-0400 [ReplicationWorker] DEBUG org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager - markLedgerReplicated(ledgerId=36955355) 2023-08-21T13:10:49,421-0400 [ReplicationWorker] DEBUG org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager - releaseLedger(ledgerId=36955355) 2023-08-21T13:10:49,430-0400 [ReplicationWorker] DEBUG org.apache.pulsar.metadata.bookkeeper.PulsarLedgerManager - Unregistered ledger metadata listener ReadOnlyLedgerHandle(lid = 36955355, id = 1009584614) on ledger 36955355. 2023-08-21T13:10:49,431-0400 [ReplicationWorker] DEBUG org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager - releaseLedger(ledgerId=36955355) 2023-08-21T13:10:49,431-0400 [ReplicationWorker] DEBUG org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager - getLedgerToRereplicate() 2023-08-21T13:10:49,431-0400 [ReplicationWorker] DEBUG org.apache.pulsar.metadata.bookkeeper.PulsarLedgerUnderreplicationManager - isLedgerReplicationEnabled() 2023-08-21T13:17:15,122-0400 [ReplicationWorker] DEBUG org.apache.bookkeeper.replication.ReplicationWorker - Going to replicate the fragments of the ledger: 23993683 2023-08-21T13:17:15,167-0400 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - Ledger replicated successfully. ledger id is: 23993683 ``` We can see the ledger `36955355` was completed quickly, it only took about 2 seconds (13:10:47-13:10:49). After `36955355` completing, the ReplicationWorker will get the next underReplicated ledger id from the metadata store. From the logs, it takes about 6 minutes; this means the method `PulsarLedgerUnderreplicationManager#getLedgerToRereplicate` takes about 6 minutes. in `PulsarLedgerUnderreplicationManager#getLedgerToRereplicate`, it will iterate the child nodes under `/ledgers/underreplication/ledgers`. And it only picks the node whose depth is 5, like `/ledgers/underreplication/ledgers/0000/0000/0000/4441/urL0000017473`. If the path only has 4 depths, like `/ledgers/underreplication/ledgers/0000/0000/0000/4442`, the path will be ignored, then pick the next path. Example: There are 4 nodes. ``` /ledgers/underreplication/ledgers/0000/0000/0000/4441/urL0000017473 /ledgers/underreplication/ledgers/0000/0000/0000/45c8/urL0000017864 /ledgers/underreplication/ledgers/0000/0000/0000/4785 /ledgers/underreplication/ledgers/0000/0000/0000/4778 ``` When picking `/ledgers/underreplication/ledgers/0000/0000/0000/4785`(The pick algorithm is random), it found that it only has 4 depths ('/0000/0000/0000/4785'), the node will be ignored. Then pick the next node `/ledgers/underreplication/ledgers/0000/0000/0000/4441`, and it found that it has 5 depths(`/0000/0000/0000/4441/urL0000017473`), then return the ledger id `17473`. We found that in the customer's cluster, it has too many invalid nodes with a depth of 4. So the replicationWorker iterates too many invalid nodes, and the iteration takes a long time. ### Root cause > Why are there so many invalid nodes in the client's cluster? Finally, we found `PulsarLedgerUnderreplicationManager` introduces the problem, after `PulsarLedgerUnderreplicationManager` completes the ledger replication, it will mark the ledger as replicated. The mark operation will delete the node `/ledgers/underreplication/ledgers/0000/0000/0000/4441/urL0000017473`, then leave the node `/ledgers/underreplication/ledgers/0000/0000/0000/4441/`, so went an invalid node. https://github.com/apache/pulsar/blob/30073dbac0e941869b43e090d2682935e8f094e5/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarLedgerUnderreplicationManager.java#L393-L422 In `ZkLedgerUnderreplicationManager`, after delete the node `/ledgers/underreplication/ledgers/0000/0000/0000/4441/urL0000017473`, it checks if the child nodes are empty, and if they are, it deletes the parent node, so the node `/ledgers/underreplication/ledgers/0000/0000/0000/4441` will also be removed, and leave the node `/ledgers/underreplication/ledgers/0000/0000/0000`. https://github.com/apache/bookkeeper/blob/5b5c05331757e7356579076970e61f119f5d34ae/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java#L391-L436 _**PulsarLedgerUnderreplicationManager is at pulsar side, when the user config -`Dbookkeeper.metadata.client.drivers=org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver`, the ReplicationWorker will use it. ZkLedgerUnderreplicationManager is at the bk side; it's the default.**_ ### What did you see instead? PulsarLedgerUnderreplicationManager should also delete the empty parent nodes. ### Anything else? _No response_ ### Are you willing to submit a PR? - [X] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
