poorbarcode opened a new pull request, #19444:
URL: https://github.com/apache/pulsar/pull/19444
### Motivation
Unstable ZK connections can cause these conditions to occur:
1. add entry timeout(because the connections of BK to ZK is also unstable)
1. write ZK fail when ledger info update
1. delete ledger is delayed(because delete meta is delayed)
1. and so on
When cases `1,2,3` occur at the same time, a ledger that is in use may be
deleted. for example:
- add entry(`3:0`) timeout(eventually successful)
- the field `lastConfirmedPosition` of the current ledger is still `-1`
- the ledger info of ML is `[3]` now.
- switch ledger caused by add entry fail, but write ZK is also failure.
- delete ledger `3` async (because `lastConfirmedPosition == -1`), and the
operation is delayed.
- topic reload caused by lost ZK connection, close old ML and create new
- execute `ml.recover`
- ledger `3` is not removed during recovery because it contains data
- create new ledger `4`
- the ledger info of ML is `[3,4]` now
- the operation of `delete ledger 3` finished.
- an error occurs: the deleted ledger `3` is still in use, and the topic
load will fail by the ledger is lost.
You can reproduce this issue by `testLedgerInfoMetaCorrectIfAddEntryTimeOut`.
----
Below is a log of when an error occurs in our online environment:
##### 1. Topic load is fail
```
2022-12-02T22:33:44.115106032-05:00 stdout F 2022-12-03T03:33:44,115+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN
org.apache.bookkeeper.mledger.impl.OpReadEntry -
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at
position:21347:0
2022-12-02T22:33:44.115018833-05:00 stdout F 2022-12-03T03:33:44,114+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
[xes/etl/persistent/stu_order] Error opening ledger for reading at position
21347:0 -
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
No such ledger exists on Metadata Server
2022-12-02T22:33:44.454765118-05:00 stdout F 2022-12-03T03:33:44,454+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
[xes/etl/persistent/stu_order] Error opening ledger for reading at position
21347:0 -
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
No such ledger exists on Metadata Server
2022-12-02T22:33:44.454865574-05:00 stdout F 2022-12-03T03:33:44,454+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN
org.apache.bookkeeper.mledger.impl.OpReadEntry -
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at
position:21347:0
2022-12-02T22:33:44.957770974-05:00 stdout F 2022-12-03T03:33:44,957+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
[xes/etl/persistent/stu_order] Error opening ledger for reading at position
21347:0 -
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
No such ledger exists on Metadata Server
2022-12-02T22:33:44.957880879-05:00 stdout F 2022-12-03T03:33:44,957+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN
org.apache.bookkeeper.mledger.impl.OpReadEntry -
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at
position:21347:0
2022-12-02T22:33:45.384780673-05:00 stdout F 2022-12-03T03:33:45,384+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
[xes/etl/persistent/stu_order] Error opening ledger for reading at position
21347:0 -
org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException:
No such ledger exists on Metadata Server
2022-12-02T22:33:45.384874832-05:00 stdout F 2022-12-03T03:33:45,384+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN
org.apache.bookkeeper.mledger.impl.OpReadEntry -
[xes/etl/persistent/stu_order][pulsar.dedup] read failed from ledger at
position:21347:0
```
##### 2. The connection of ZK is unstable
```
# pulsar-broker-4
2022-12-02T22:23:42.659683926-05:00 stdout F 2022-12-03T03:23:42,659+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.pulsar.broker.PulsarService - Received metadata service session
event: ConnectionLost
# pulsar-broker-4
2022-12-02T22:23:42.660145113-05:00 stdout F 2022-12-03T03:23:42,659+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: ConnectionLost
# pulsar-broker-0
2022-12-02T22:23:48.738214552-05:00 stdout F 2022-12-03T03:23:48,738+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: ConnectionLost
# pulsar-broker-0
2022-12-02T22:23:48.73816115-05:00 stdout F 2022-12-03T03:23:48,738+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.pulsar.broker.PulsarService - Received metadata service session
event: ConnectionLost
# pulsar-broker-4
2022-12-02T22:24:08.660979585-05:00 stdout F 2022-12-03T03:24:08,660+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.pulsar.broker.PulsarService - Received metadata service session
event: SessionLost
# pulsar-broker-4
2022-12-02T22:24:08.660989559-05:00 stdout F 2022-12-03T03:24:08,660+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: SessionLost
# pulsar-broker-0
2022-12-02T22:24:14.703831204-05:00 stdout F 2022-12-03T03:24:14,703+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: SessionLost
# pulsar-broker-0
2022-12-02T22:24:14.7038165-05:00 stdout F 2022-12-03T03:24:14,703+0000
[metadata-store-zk-session-watcher-11-1] INFO
org.apache.pulsar.broker.PulsarService - Received metadata service session
event: SessionLost
# pulsar-broker-4
2022-12-02T22:25:16.667165767-05:00 stdout F 2022-12-03T03:25:16,666+0000
[ZKC-connect-executor-0-EventThread] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: Reconnected
# pulsar-broker-4
2022-12-02T22:25:16.667042947-05:00 stdout F 2022-12-03T03:25:16,666+0000
[ZKC-connect-executor-0-EventThread] INFO
org.apache.pulsar.broker.PulsarService - Received metadata service session
event: Reconnected
# pulsar-broker-4
2022-12-02T22:25:16.681716421-05:00 stdout F 2022-12-03T03:25:16,681+0000
[ZKC-connect-executor-0-EventThread] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: SessionReestablished
# pulsar-broker-4
2022-12-02T22:25:16.681441416-05:00 stdout F 2022-12-03T03:25:16,681+0000
[ZKC-connect-executor-0-EventThread] INFO
org.apache.pulsar.broker.PulsarService - Received metadata service session
event: SessionReestablished
# pulsar-broker-4
2022-12-02T22:25:16.761244318-05:00 stdout F 2022-12-03T03:25:16,760+0000
[metadata-store-coordination-service-12-1] WARN
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl - Failure when
processing session event
# pulsar-broker-0
2022-12-02T22:33:24.741380081-05:00 stdout F 2022-12-03T03:33:24,741+0000
[main-EventThread] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: Reconnected
# pulsar-broker-0
2022-12-02T22:33:24.74137549-05:00 stdout F 2022-12-03T03:33:24,741+0000
[main-EventThread] INFO org.apache.pulsar.broker.PulsarService - Received
metadata service session event: Reconnected
# pulsar-broker-0
2022-12-02T22:33:24.755746638-05:00 stdout F 2022-12-03T03:33:24,754+0000
[main-EventThread] INFO org.apache.pulsar.broker.PulsarService - Received
metadata service session event: SessionReestablished
# pulsar-broker-0
2022-12-02T22:33:24.755772892-05:00 stdout F 2022-12-03T03:33:24,754+0000
[main-EventThread] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received
MetadataStore session event: SessionReestablished
```
##### 3.The Topic switches brokers frequently
```
# pulsar-broker-4
2022-12-02T22:21:01.15839105-05:00 stdout F 2022-12-03T03:21:01,157+0000
[bookkeeper-ml-scheduler-OrderedScheduler-1-0] ERROR
org.apache.pulsar.broker.service.persistent.PersistentTopic -
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding
anyway.
2022-12-02T22:21:01.679585396-05:00 stdout F 2022-12-03T03:21:01,679+0000
[ZKC-connect-executor-0-EventThread] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger
xes/etl/persistent/stu_order
# pulsar-broker-0
2022-12-02T22:21:35.923492915-05:00 stdout F 2022-12-03T03:21:35,923+0000
[bookkeeper-ml-scheduler-OrderedScheduler-1-0] ERROR
org.apache.pulsar.broker.service.persistent.PersistentTopic -
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding
anyway.
2022-12-02T22:21:36.252018422-05:00 stdout F 2022-12-03T03:21:36,251+0000
[pulsar-io-4-3] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
Opening managed ledger xes/etl/persistent/stu_order
# pulsar-broker-4
2022-12-02T22:23:11.159764484-05:00 stdout F 2022-12-03T03:23:11,159+0000
[pulsar-io-4-1] ERROR
org.apache.pulsar.broker.service.persistent.PersistentTopic -
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding
anyway.
2022-12-02T22:23:11.333602486-05:00 stdout F 2022-12-03T03:23:11,333+0000
[pulsar-io-4-1] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
Opening managed ledger xes/etl/persistent/stu_order
# pulsar-broker-0
2022-12-02T22:23:12.962155931-05:00 stdout F 2022-12-03T03:23:12,961+0000
[pulsar-io-4-3] ERROR
org.apache.pulsar.broker.service.persistent.PersistentTopic -
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding
anyway.
2022-12-02T22:23:13.20731387-05:00 stdout F 2022-12-03T03:23:13,207+0000
[pulsar-io-4-3] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
Opening managed ledger xes/etl/persistent/stu_order
# pulsar-broker-4
2022-12-02T22:26:35.030681434-05:00 stdout F 2022-12-03T03:26:35,030+0000
[pulsar-io-4-2] ERROR
org.apache.pulsar.broker.service.persistent.PersistentTopic -
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding
anyway.
2022-12-02T22:27:01.794766318-05:00 stdout F 2022-12-03T03:27:01,794+0000
[pulsar-io-4-2] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
Opening managed ledger xes/etl/persistent/stu_order
# pulsar-broker-0
2022-12-02T22:33:24.867254146-05:00 stdout F 2022-12-03T03:33:24,866+0000
[pulsar-io-4-1] ERROR
org.apache.pulsar.broker.service.persistent.PersistentTopic -
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding
anyway.
2022-12-02T22:33:25.097914184-05:00 stdout F 2022-12-03T03:33:25,097+0000
[main-EventThread] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
Opening managed ledger xes/etl/persistent/stu_order
# pulsar-broker-4
2022-12-02T22:36:01.570683778-05:00 stdout F 2022-12-03T03:36:01,570+0000
[pulsar-io-4-4] ERROR
org.apache.pulsar.broker.service.persistent.PersistentTopic -
[persistent://xes/etl/stu_order] Failed to close managed ledger, proceeding
anyway.
2022-12-02T22:36:01.755647415-05:00 stdout F 2022-12-03T03:36:01,755+0000
[pulsar-io-4-4] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
Opening managed ledger xes/etl/persistent/stu_order
2022-12-02T22:38:47.212421358-05:00 stdout F 2022-12-03T03:38:47,212+0000
[ZKC-connect-executor-0-EventThread] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger
public/default/persistent/__sla
```
##### 4. Ledger-21347's life cycle
```
2022-12-02T22:23:13.250490551-05:00 stdout F 2022-12-03T03:23:13,250+0000
[main-EventThread] INFO org.apache.bookkeeper.client.LedgerCreateOp -
Ensemble: [pulsar-bookie-1.pulsar-bookie.pulsar.svc.cluster.local:3181,
pulsar-bookie-4.pulsar-bookie.pulsar.svc.cluster.local:3181] for ledger: 21347
2022-12-02T22:23:13.250647455-05:00 stdout F 2022-12-03T03:23:13,250+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
[xes/etl/persistent/stu_order] Created ledger 21347
2022-12-02T22:24:14.296700424-05:00 stdout F 2022-12-03T03:24:14,296+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN
org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (21347, 0):
Bookie operation timeout
2022-12-02T22:27:01.888372507-05:00 stdout F 2022-12-03T03:27:01,888+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO
org.apache.bookkeeper.client.ReadOnlyLedgerHandle - Closing recovered ledger
21347 at entry 0
2022-12-02T22:33:24.742114702-05:00 stdout F 2022-12-03T03:33:24,742+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO
org.apache.bookkeeper.client.LedgerHandle - New Ensemble:
[pulsar-bookie-0.pulsar-bookie.pulsar.svc.cluster.local:3181,
pulsar-bookie-4.pulsar-bookie.pulsar.svc.cluster.local:3181] for ledger: 21347
2022-12-02T22:33:24.743599829-05:00 stdout F 2022-12-03T03:33:24,743+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR
org.apache.bookkeeper.client.PendingAddOp - Write of ledger entry to quorum
failed: L21347 E0
2022-12-02T22:33:24.743596141-05:00 stdout F 2022-12-03T03:33:24,743+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR
org.apache.bookkeeper.client.LedgerHandle - Closing ledger 21347 due to
LedgerFencedException: Ledger has been fenced off. Some other client must have
opened it to read
2022-12-02T22:33:24.743589628-05:00 stdout F 2022-12-03T03:33:24,743+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] WARN
org.apache.bookkeeper.client.PendingAddOp - Fencing exception on write: L21347
E0 on pulsar-bookie-0.pulsar-bookie.pulsar.svc.cluster.local:3181
2022-12-02T22:33:24.772910842-05:00 stdout F 2022-12-03T03:33:24,772+0000
[main-EventThread] INFO org.apache.bookkeeper.client.MetadataUpdateLoop -
UpdateLoop(ledgerId=21347,loopId=31b528bd) conflict writing metadata to store,
update local value and try again
2022-12-02T22:33:24.787555553-05:00 stdout F 2022-12-03T03:33:24,787+0000
[main-EventThread] ERROR org.apache.bookkeeper.client.LedgerHandle - Metadata
conflict when closing ledger 21347. Another client may have recovered the
ledger while there were writes outstanding. (local lastEntry:-1 length:0)
(metadata lastEntry:0 length:815)
2022-12-02T22:33:24.788055839-05:00 stdout F 2022-12-03T03:33:24,787+0000
[main-EventThread] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop -
UpdateLoop(ledgerId=21347,loopId=31b528bd) Exception updating
2022-12-02T22:33:24.802360201-05:00 stdout F 2022-12-03T03:33:24,802+0000
[BookKeeperClientWorker-OrderedExecutor-1-0] INFO
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -
[xes/etl/persistent/stu_order] Delete complete for empty ledger 2134. rc=0
```
### Modifications
Only ledgers removed from the meta of ledger info can be deleted from the BK.
### Documentation
<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
- [ ] `doc` <!-- Your PR contains doc changes. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update
later -->
- [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->
### Matching PR in forked repository
PR in forked repository:
- https://github.com/poorbarcode/pulsar/pull/65
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]