poorbarcode opened a new pull request, #19444:
URL: https://github.com/apache/pulsar/pull/19444

   ### Motivation
   
   Unstable ZK connections can cause these conditions to occur: 
   1. add entry timeout(because the connections of BK to ZK is also unstable)
   1. write ZK fail when ledger info update
   1. delete ledger is delayed(because delete meta is delayed)
   1. and so on
   
   When cases `1,2,3` occur at the same time, a ledger that is in use may be 
deleted. for example:
   - add entry(`3:0`) timeout(eventually successful)
     - the field `lastConfirmedPosition` of the current ledger is still `-1`
     - the ledger info of ML is `[3]` now.
   - switch ledger caused by add entry fail, but write ZK is also failure.
     - delete ledger `3` async (because `lastConfirmedPosition == -1`), and the 
operation is delayed.
   - topic reload caused by lost ZK connection, close old ML and create new
   - execute `ml.recover`
     - ledger `3` is not removed during recovery because it contains data 
     - create new ledger `4`
     - the ledger info of ML is `[3,4]` now
   - the operation of `delete ledger 3` finished.
   - an error occurs: the deleted ledger `3` is still in use, and the topic 
load will fail by the ledger is lost.
   
   You can reproduce this issue by `testLedgerInfoMetaCorrectIfAddEntryTimeOut`.
   
   ----
   
   Below is a log of when an error occurs in our online environment:
   
   ##### 1. Topic load is fail
   ```
   2022-12-02T22:33:44.115106032-05:00 stdout F 2022-12-03T03:33:44,115+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN  
org.apache.bookkeeper.mledger.impl.OpReadEntry - 
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at 
position:21347:0
   2022-12-02T22:33:44.115018833-05:00 stdout F 2022-12-03T03:33:44,114+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR 
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[xes/etl/persistent/stu_order] Error opening ledger for reading at position 
21347:0 - 
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
 No such ledger exists on Metadata Server
   2022-12-02T22:33:44.454765118-05:00 stdout F 2022-12-03T03:33:44,454+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR 
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[xes/etl/persistent/stu_order] Error opening ledger for reading at position 
21347:0 - 
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
 No such ledger exists on Metadata Server
   2022-12-02T22:33:44.454865574-05:00 stdout F 2022-12-03T03:33:44,454+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN  
org.apache.bookkeeper.mledger.impl.OpReadEntry - 
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at 
position:21347:0
   2022-12-02T22:33:44.957770974-05:00 stdout F 2022-12-03T03:33:44,957+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR 
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[xes/etl/persistent/stu_order] Error opening ledger for reading at position 
21347:0 - 
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
 No such ledger exists on Metadata Server
   2022-12-02T22:33:44.957880879-05:00 stdout F 2022-12-03T03:33:44,957+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN  
org.apache.bookkeeper.mledger.impl.OpReadEntry - 
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at 
position:21347:0
   2022-12-02T22:33:45.384780673-05:00 stdout F 2022-12-03T03:33:45,384+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR 
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[xes/etl/persistent/stu_order] Error opening ledger for reading at position 
21347:0 - 
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
 No such ledger exists on Metadata Server
   2022-12-02T22:33:45.384874832-05:00 stdout F 2022-12-03T03:33:45,384+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN  
org.apache.bookkeeper.mledger.impl.OpReadEntry - 
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at 
position:21347:0
   ```
   
   ##### 2. The connection of ZK is unstable
   ```
   # pulsar-broker-4
   2022-12-02T22:23:42.659683926-05:00 stdout F 2022-12-03T03:23:42,659+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.pulsar.broker.PulsarService - Received metadata service session 
event: ConnectionLost
   # pulsar-broker-4
   2022-12-02T22:23:42.660145113-05:00 stdout F 2022-12-03T03:23:42,659+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: ConnectionLost 
   # pulsar-broker-0
   2022-12-02T22:23:48.738214552-05:00 stdout F 2022-12-03T03:23:48,738+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: ConnectionLost
   # pulsar-broker-0
   2022-12-02T22:23:48.73816115-05:00 stdout F 2022-12-03T03:23:48,738+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.pulsar.broker.PulsarService - Received metadata service session 
event: ConnectionLost
   # pulsar-broker-4
   2022-12-02T22:24:08.660979585-05:00 stdout F 2022-12-03T03:24:08,660+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.pulsar.broker.PulsarService - Received metadata service session 
event: SessionLost
   # pulsar-broker-4
   2022-12-02T22:24:08.660989559-05:00 stdout F 2022-12-03T03:24:08,660+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: SessionLost
   # pulsar-broker-0
   2022-12-02T22:24:14.703831204-05:00 stdout F 2022-12-03T03:24:14,703+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: SessionLost
   # pulsar-broker-0
   2022-12-02T22:24:14.7038165-05:00 stdout F 2022-12-03T03:24:14,703+0000 
[metadata-store-zk-session-watcher-11-1] INFO  
org.apache.pulsar.broker.PulsarService - Received metadata service session 
event: SessionLost
   # pulsar-broker-4
   2022-12-02T22:25:16.667165767-05:00 stdout F 2022-12-03T03:25:16,666+0000 
[ZKC-connect-executor-0-EventThread] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: Reconnected
   # pulsar-broker-4
   2022-12-02T22:25:16.667042947-05:00 stdout F 2022-12-03T03:25:16,666+0000 
[ZKC-connect-executor-0-EventThread] INFO  
org.apache.pulsar.broker.PulsarService - Received metadata service session 
event: Reconnected
   # pulsar-broker-4
   2022-12-02T22:25:16.681716421-05:00 stdout F 2022-12-03T03:25:16,681+0000 
[ZKC-connect-executor-0-EventThread] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: SessionReestablished
   # pulsar-broker-4
   2022-12-02T22:25:16.681441416-05:00 stdout F 2022-12-03T03:25:16,681+0000 
[ZKC-connect-executor-0-EventThread] INFO  
org.apache.pulsar.broker.PulsarService - Received metadata service session 
event: SessionReestablished
   # pulsar-broker-4
   2022-12-02T22:25:16.761244318-05:00 stdout F 2022-12-03T03:25:16,760+0000 
[metadata-store-coordination-service-12-1] WARN  
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Failure when 
processing session event
   # pulsar-broker-0
   2022-12-02T22:33:24.741380081-05:00 stdout F 2022-12-03T03:33:24,741+0000 
[main-EventThread] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: Reconnected
   # pulsar-broker-0
   2022-12-02T22:33:24.74137549-05:00 stdout F 2022-12-03T03:33:24,741+0000 
[main-EventThread] INFO  org.apache.pulsar.broker.PulsarService - Received 
metadata service session event: Reconnected
   # pulsar-broker-0
   2022-12-02T22:33:24.755746638-05:00 stdout F 2022-12-03T03:33:24,754+0000 
[main-EventThread] INFO  org.apache.pulsar.broker.PulsarService - Received 
metadata service session event: SessionReestablished
   # pulsar-broker-0
   2022-12-02T22:33:24.755772892-05:00 stdout F 2022-12-03T03:33:24,754+0000 
[main-EventThread] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received 
MetadataStore session event: SessionReestablished
   ```
   
   ##### 3.The Topic switches brokers frequently
   ```
   #  pulsar-broker-4
   2022-12-02T22:21:01.15839105-05:00 stdout F 2022-12-03T03:21:01,157+0000 
[bookkeeper-ml-scheduler-OrderedScheduler-1-0] ERROR 
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding 
anyway. 
   2022-12-02T22:21:01.679585396-05:00 stdout F 2022-12-03T03:21:01,679+0000 
[ZKC-connect-executor-0-EventThread] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger 
xes/etl/persistent/stu_order
   # pulsar-broker-0
   2022-12-02T22:21:35.923492915-05:00 stdout F 2022-12-03T03:21:35,923+0000 
[bookkeeper-ml-scheduler-OrderedScheduler-1-0] ERROR 
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding 
anyway. 
   2022-12-02T22:21:36.252018422-05:00 stdout F 2022-12-03T03:21:36,251+0000 
[pulsar-io-4-3] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
Opening managed ledger xes/etl/persistent/stu_order
   # pulsar-broker-4
   2022-12-02T22:23:11.159764484-05:00 stdout F 2022-12-03T03:23:11,159+0000 
[pulsar-io-4-1] ERROR 
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding 
anyway. 
   2022-12-02T22:23:11.333602486-05:00 stdout F 2022-12-03T03:23:11,333+0000 
[pulsar-io-4-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
Opening managed ledger xes/etl/persistent/stu_order
   # pulsar-broker-0
   2022-12-02T22:23:12.962155931-05:00 stdout F 2022-12-03T03:23:12,961+0000 
[pulsar-io-4-3] ERROR 
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding 
anyway. 
   2022-12-02T22:23:13.20731387-05:00 stdout F 2022-12-03T03:23:13,207+0000 
[pulsar-io-4-3] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
Opening managed ledger xes/etl/persistent/stu_order
   #  pulsar-broker-4
   2022-12-02T22:26:35.030681434-05:00 stdout F 2022-12-03T03:26:35,030+0000 
[pulsar-io-4-2] ERROR 
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding 
anyway. 
   2022-12-02T22:27:01.794766318-05:00 stdout F 2022-12-03T03:27:01,794+0000 
[pulsar-io-4-2] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
Opening managed ledger xes/etl/persistent/stu_order
   # pulsar-broker-0
   2022-12-02T22:33:24.867254146-05:00 stdout F 2022-12-03T03:33:24,866+0000 
[pulsar-io-4-1] ERROR 
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding 
anyway. 
   2022-12-02T22:33:25.097914184-05:00 stdout F 2022-12-03T03:33:25,097+0000 
[main-EventThread] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
Opening managed ledger xes/etl/persistent/stu_order
   #  pulsar-broker-4
   2022-12-02T22:36:01.570683778-05:00 stdout F 2022-12-03T03:36:01,570+0000 
[pulsar-io-4-4] ERROR 
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding 
anyway. 
   2022-12-02T22:36:01.755647415-05:00 stdout F 2022-12-03T03:36:01,755+0000 
[pulsar-io-4-4] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
Opening managed ledger xes/etl/persistent/stu_order
   2022-12-02T22:38:47.212421358-05:00 stdout F 2022-12-03T03:38:47,212+0000 
[ZKC-connect-executor-0-EventThread] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger 
public/default/persistent/__sla
   ```
   
   ##### 4. Ledger-21347's life cycle
   ```
   2022-12-02T22:23:13.250490551-05:00 stdout F 2022-12-03T03:23:13,250+0000 
[main-EventThread] INFO  org.apache.bookkeeper.client.LedgerCreateOp - 
Ensemble: [pulsar-bookie-1.pulsar-bookie.pulsar.svc.cluster.local:3181, 
pulsar-bookie-4.pulsar-bookie.pulsar.svc.cluster.local:3181] for ledger: 21347
   2022-12-02T22:23:13.250647455-05:00 stdout F 2022-12-03T03:23:13,250+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[xes/etl/persistent/stu_order] Created ledger 21347
   2022-12-02T22:24:14.296700424-05:00 stdout F 2022-12-03T03:24:14,296+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN  
org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (21347, 0): 
Bookie operation timeout
   2022-12-02T22:27:01.888372507-05:00 stdout F 2022-12-03T03:27:01,888+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO  
org.apache.bookkeeper.client.ReadOnlyLedgerHandle - Closing recovered ledger 
21347 at entry 0
   2022-12-02T22:33:24.742114702-05:00 stdout F 2022-12-03T03:33:24,742+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO  
org.apache.bookkeeper.client.LedgerHandle - New Ensemble: 
[pulsar-bookie-0.pulsar-bookie.pulsar.svc.cluster.local:3181, 
pulsar-bookie-4.pulsar-bookie.pulsar.svc.cluster.local:3181] for ledger: 21347
   2022-12-02T22:33:24.743599829-05:00 stdout F 2022-12-03T03:33:24,743+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR 
org.apache.bookkeeper.client.PendingAddOp - Write of ledger entry to quorum 
failed: L21347 E0
   2022-12-02T22:33:24.743596141-05:00 stdout F 2022-12-03T03:33:24,743+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR 
org.apache.bookkeeper.client.LedgerHandle - Closing ledger 21347 due to 
LedgerFencedException: Ledger has been fenced off. Some other client must have 
opened it to read
   2022-12-02T22:33:24.743589628-05:00 stdout F 2022-12-03T03:33:24,743+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN  
org.apache.bookkeeper.client.PendingAddOp - Fencing exception on write: L21347 
E0 on pulsar-bookie-0.pulsar-bookie.pulsar.svc.cluster.local:3181
   2022-12-02T22:33:24.772910842-05:00 stdout F 2022-12-03T03:33:24,772+0000 
[main-EventThread] INFO  org.apache.bookkeeper.client.MetadataUpdateLoop - 
UpdateLoop(ledgerId=21347,loopId=31b528bd) conflict writing metadata to store, 
update local value and try again
   2022-12-02T22:33:24.787555553-05:00 stdout F 2022-12-03T03:33:24,787+0000 
[main-EventThread] ERROR org.apache.bookkeeper.client.LedgerHandle - Metadata 
conflict when closing ledger 21347. Another client may have recovered the 
ledger while there were writes outstanding. (local lastEntry:-1 length:0)  
(metadata lastEntry:0 length:815) 
   2022-12-02T22:33:24.788055839-05:00 stdout F 2022-12-03T03:33:24,787+0000 
[main-EventThread] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - 
UpdateLoop(ledgerId=21347,loopId=31b528bd) Exception updating
   2022-12-02T22:33:24.802360201-05:00 stdout F 2022-12-03T03:33:24,802+0000 
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[xes/etl/persistent/stu_order] Delete complete for empty ledger 2134. rc=0
   ```
    
   ### Modifications
   Only ledgers removed from the meta of ledger info can be deleted from the BK.
   
   ### Documentation
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [ ] `doc` <!-- Your PR contains doc changes. -->
   - [ ] `doc-required` <!-- Your PR changes impact docs and you will update 
later -->
   - [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
   - [ ] `doc-complete` <!-- Docs have been already added -->
   
   ### Matching PR in forked repository
   
   PR in forked repository: 
   - https://github.com/poorbarcode/pulsar/pull/65
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to