[I] Racing Condition between Admin Recovery and Over-Replicated Ledger GC (on Destination BK) [bookkeeper]

via GitHub Thu, 25 Sep 2025 16:28:57 -0700


heesung-sohn opened a new issue, #4668:
URL: https://github.com/apache/bookkeeper/issues/4668


   **BUG REPORT**
   
   ***Describe the bug***
   bk recover command throws BKLedgerFencedException.
   
   bin/bookkeeper shell recover 
pulsar-mini-bookie-4.pulsar-mini-bookie.pulsar.svc.cluster.local:3181 
   
   ```
   bookie:pulsar-mini-bookie-2.pulsar-mini-bookie.pulsar.svc.cluster.local:3181 
rc:EOK, ledger:20:entry:110:entryLength:23886
   2025-09-25T20:09:23,397+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] 
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Add response 
from 
bookie:pulsar-mini-bookie-0.pulsar-mini-bookie.pulsar.svc.cluster.local:3181 
rc:EFENCED, ledger:7:entry:3900
   2025-09-25T20:09:23,397+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] 
ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error writing 
entry for ledgerId: 7, entryId: 3900, bookie: 
pulsar-mini-bookie-0.pulsar-mini-bookie.pulsar.svc.cluster.local:3181
   org.apache.bookkeeper.client.BKException$BKLedgerFencedException: Ledger has 
been fenced off. Some other client must have opened it to read
        at org.apache.bookkeeper.client.BKException.create(BKException.java:99) 
~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
        at 
org.apache.bookkeeper.client.LedgerFragmentReplicator$2.writeComplete(LedgerFragmentReplicator.java:357)
 ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$AddCompletion.writeComplete(PerChannelBookieClient.java:2200)
 ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$AddCompletion.handleResponse(PerChannelBookieClient.java:2257)
 ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$AddCompletion.handleV2Response(PerChannelBookieClient.java:2236)
 ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
        at 
org.apache.bookkeeper.proto.PerChannelBookieClient$ReadV2ResponseCallback.run(PerChannelBookieClient.java:1407)
 ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
        at 
org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
 ~[org.apache.bookkeeper-bookkeeper-common-4.16.7.jar:4.16.7]
        at 
org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107)
 ~[org.apache.bookkeeper-bookkeeper-common-4.16.7.jar:4.16.7]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[io.netty-netty-common-4.1.122.Final.jar:4.1.122.Final]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
   ```
   
   This ledger fenced exception during recovery can be caused by a racing 
condition between admin ledger recovery and bk’s over-replicated-ledger gc. If 
admin ledger recovery is slow, and if the over-replicated-ledger gc happens to 
clean recovering ledgers, this fenced exception could hit and further block the 
recovery.
   
   
   To mitigate the issue, we could restart the destination bookies. Also, we 
could consider disabling this over-replicated-ledger gc during the decommission 
process(by gcOverreplicatedLedgerWaitTime=-1).
   
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   
   pulsar cluster with e.g. 5 bookies, ES=WQ=2 with the following configs.
   ```
   bookkeeper:
     podMonitor:
       enabled: false
     replicaCount: 5
     configData:
       # minimal memory use for bookkeeper
       # 
https://bookkeeper.apache.org/docs/reference/config#db-ledger-storage-settings
       dbStorage_writeCacheMaxSizeMb: "32"
       dbStorage_readAheadCacheMaxSizeMb: "32"
       dbStorage_rocksDB_writeBufferSizeMB: "8"
       dbStorage_rocksDB_blockCacheSize: "8388608"
       httpServerEnabled: "true"
       gcWaitTime: "60000"
       gcOverreplicatedLedgerWaitTime: "30000"
       openLedgerRereplicationGracePeriod: "100"
   
   broker:
     podMonitor:
       enabled: false
     replicaCount: 2
     configData:
       ## Enable `autoSkipNonRecoverableData` since bookkeeper is running
       ## without persistence
       autoSkipNonRecoverableData: "true"
       # storage settings
       managedLedgerDefaultEnsembleSize: "2"
       managedLedgerDefaultWriteQuorum: "2"
       managedLedgerDefaultAckQuorum: "2"
       managedLedgerMaxEntriesPerLedger: "50"
       managedLedgerMinLedgerRolloverTimeMinutes: "1"
   ```
   
   
   ```
   # producer
   bin/pulsar-admin tenants create my-tenant
   bin/pulsar-admin namespaces create my-tenant/my-ns
   bin/pulsar-admin namespaces set-retention my-tenant/my-ns --size -1 --time -1
   bin/pulsar-admin topics create-partitioned-topic 
persistent://my-tenant/my-ns/test-topic2 -p 10
   bin/pulsar-admin topics list my-tenant/my-ns
   bin/pulsar-admin namespaces get-retention my-tenant/my-ns
   bin/pulsar-perf produce persistent://my-tenant/my-ns/test-topic2 -r 10000
   
   
   # recover command on bk-4
   curl -X PUT -H "Content-Type: application/json" -d '{"readOnly": true}' 
http://localhost:8000/api/v1/bookie/state/readonly
   bin/bookkeeper shell recover 
pulsar-mini-bookie-4.pulsar-mini-bookie.pulsar.svc.cluster.local:3181  -rate 
1000000
   ```
   
   This will simulate this case:
   
   t0: L1.metadata = F1:[bk1,bk3], F2:[bk1,bk3]. L1 is replicated to bk1 and 
bk3 and has two fragments F1 and F2.
   t1: admin recover started with src bk3(to decommission bk3).
   t2: admin recover copied L1:F1 to bk2.
   t3: bk2 over-replicated cleanup kicked-in (timeout by 
gcOverreplicatedLedgerWaitTime)
   t4: bk2 sees L1 in its local active ledger list but the L1.metadata still 
shows L1.F1 ensemble is [bk1,bk3] (recovery ledger copy and metadata update are 
NOT transactional).
   t5: bk2 removed L1 locally as it thinks that L1 is over-replicated on bk2.
   t6: bk2 put L1 in recentlyFencedAndDeletedLedgers at 
markIfConflictWritingOccurs (ledgerDescriptor.isFenced() throws exception 
because [we remove the ledger index 
first.](https://github.com/apache/bookkeeper/blob/branch-4.17/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/SingleDirectoryDbLedgerStorage.java#L938)
 )
   t7: the admin recover continued and tried to copy L1:F2 to bk2. It hit the 
ledger fenced exception from recentlyFencedAndDeletedLedgers check(L1).
   
   
   
   ***Expected behavior***
   The bk recover command finishes without exception.
   
   
   ***Screenshots***
   
   
   ***Additional context***
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Racing Condition between Admin Recovery and Over-Replicated Ledger GC (on Destination BK) [bookkeeper]

Reply via email to