davtex opened a new issue, #3905:
URL: https://github.com/apache/bookkeeper/issues/3905
**BUG REPORT**
To be completely honest I am not sure if this is a bug or I just seriously
misunderstand/misconfigure something.
- running Pulsar 2.9.3 in Openshift platform in AWS.
- 3 bookies running
- ledgers configured as 3:3:2
We lose one of the bookies (bookie-2), all data is lost. We delete cookie,
reconnect bookie (with same BookieID). Now we have bookie-0 and bookie-1 with
ledger intact, bookie-2 without ledger, and ledger metadata in Zookeeper
pointing to all 3 of them. Autorecovery runs and confirms above state. Now I
would expect autorecovery to replicate missing data from bookie-0/bookie-1 to
bookie-2. But what happens is that autorecovery removes all bookies from list
of available bookies, including bookie-2, which does not hold the ledger, it
just happens to still be listed in ledger's metadata. I do not understand why
this is happening -> I would expect bookie-2 to be considered for replication
after it replies with "NOLEDGER". Only way to fix this issue for me is to start
4th bookie. Is this how this should work? Do I need to keep ensemble lower than
my total number of running bookies simply to have a spare for autorecovery?
***To Reproduce***
Steps to reproduce the behavior:
1. Run 3 bookies (0,1,2)
2. Create test data
```
shell simpletest -ensemble 3 -writeQuorum 3 -ackQuorum 2 -numEntries 100
```
3. Shut down bookie-2 and delete all data from bookie-2
4. Remove cookie of bookie-2
5. Start bookie-2
6. Enable autorecovery, watch it fail with "Not enough non-faulty bookies
available"
```
2023-04-05T16:54:54,050+0000 [BookKeeperClientWorker-OrderedExecutor-0-0]
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response
from
bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
rc:EOK, ledger:74:entry:0:entryLength:136
2023-04-05T16:54:54,051+0000 [BookKeeperClientWorker-OrderedExecutor-0-0]
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response
from
bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
rc:EOK, ledger:74:entry:99:entryLength:136
2023-04-05T16:54:54,052+0000 [BookKeeperClientWorker-OrderedExecutor-0-0]
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response
from
bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
rc:EOK, ledger:74:entry:0:entryLength:136
2023-04-05T16:54:54,053+0000 [BookKeeperClientWorker-OrderedExecutor-0-0]
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response
from
bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
rc:EOK, ledger:74:entry:99:entryLength:136
2023-04-05T16:54:54,055+0000 [BookKeeperClientWorker-OrderedExecutor-0-0]
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response
from
bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
rc:ENOLEDGER, ledger:74:entry:0:entryLength:0
2023-04-05T16:54:54,056+0000 [ReplicationWorker] DEBUG
org.apache.bookkeeper.replication.ReplicationWorker - Founds fragments
[Fragment(LedgerID: 74, FirstEntryID: 0[0], LastKnownEntryID: 99[99], Host:
[prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181],
Closed: true)] for replication from ledger: 74
2023-04-05T16:54:54,056+0000 [ReplicationWorker] DEBUG
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Try to
choose a new bookie to replace
prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
from ensemble
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
excluding
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
2023-04-05T16:54:54,057+0000 [BookKeeperClientWorker-OrderedExecutor-0-0]
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response
from
bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
rc:ENOLEDGER, ledger:74:entry:99:entryLength:0
2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to
find 1 bookies : excludeBookies
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
allBookies
[<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to
choose a bookie: excluded
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
fallback to choose bookie randomly from the cluster.
2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to
find 1 bookies : excludeBookies
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
allBookies
[<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
2023-04-05T16:54:54,058+0000 [ReplicationWorker] WARN
org.apache.bookkeeper.replication.ReplicationWorker -
BKNotEnoughBookiesException while replicating the fragment
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not
enough non-faulty bookies available
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectRandomInternal(RackawareEnsemblePlacementPolicyImpl.java:781)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectRandom(RackawareEnsemblePlacementPolicyImpl.java:698)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectFromNetworkLocation(RackawareEnsemblePlacementPolicyImpl.java:587)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.selectFromNetworkLocation(RackawareEnsemblePlacementPolicy.java:206)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectFromNetworkLocation(RackawareEnsemblePlacementPolicyImpl.java:547)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.selectFromNetworkLocation(RackawareEnsemblePlacementPolicy.java:227)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.replaceBookie(RackawareEnsemblePlacementPolicyImpl.java:475)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.replaceBookie(RackawareEnsemblePlacementPolicy.java:120)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.BookKeeperAdmin.getReplacementBookiesByIndexes(BookKeeperAdmin.java:1080)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.client.BookKeeperAdmin.replicateLedgerFragment(BookKeeperAdmin.java:1129)
~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:377)
[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:269)
[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:230)
[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
at java.lang.Thread.run(Thread.java:829) [?:?]
2023-04-05T16:54:54,059+0000 [ReplicationWorker] DEBUG
org.apache.bookkeeper.client.LedgerChecker - Checking fragment
Fragment(LedgerID: 74, FirstEntryID: 0[0], LastKnownEntryID: 99[99], Host:
[prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181,
prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181,
prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181],
Closed: true)
2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0]
DEBUG org.apache.bookkeeper.proto.ReadEntryProcessor - Received new read
request: Op(2)[Ledger:74,Entry:0]
2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0]
DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage -
Get Entry: 74@0
2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0]
DEBUG org.apache.bookkeeper.proto.ReadEntryProcessor - ##### Read entry #####
136 -- ref-count: 1
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]