davtex opened a new issue, #3905:
URL: https://github.com/apache/bookkeeper/issues/3905

   **BUG REPORT**
   
   To be completely honest I am not sure if this is a bug or I just seriously 
misunderstand/misconfigure something.
   
   - running Pulsar 2.9.3 in Openshift platform in AWS.
   - 3 bookies running
   - ledgers configured as 3:3:2
   
   We lose one of the bookies (bookie-2), all data is lost. We delete cookie, 
reconnect bookie (with same BookieID). Now we have bookie-0 and bookie-1 with 
ledger intact, bookie-2 without ledger, and ledger metadata in Zookeeper 
pointing to all 3 of them. Autorecovery runs and confirms above state. Now I 
would expect autorecovery to replicate missing data from bookie-0/bookie-1 to 
bookie-2. But what happens is that autorecovery removes all bookies from list 
of available bookies, including bookie-2, which does not hold the ledger, it 
just happens to still be listed in ledger's metadata. I do not understand why 
this is happening -> I would expect bookie-2 to be considered for replication 
after it replies with "NOLEDGER". Only way to fix this issue for me is to start 
4th bookie. Is this how this should work? Do I need to keep ensemble lower than 
my total number of running bookies simply to have a spare for autorecovery?
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   1. Run 3 bookies (0,1,2)
   2. Create test data
   ```
   shell simpletest -ensemble 3 -writeQuorum 3 -ackQuorum 2 -numEntries 100
   ```
   3. Shut down bookie-2 and delete all data from bookie-2
   4. Remove cookie of bookie-2
   5. Start bookie-2
   6. Enable autorecovery, watch it fail with "Not enough non-faulty bookies 
available"
   ```
   2023-04-05T16:54:54,050+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] 
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response 
from 
bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
 rc:EOK, ledger:74:entry:0:entryLength:136
   2023-04-05T16:54:54,051+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] 
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response 
from 
bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
 rc:EOK, ledger:74:entry:99:entryLength:136
   2023-04-05T16:54:54,052+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] 
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response 
from 
bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
 rc:EOK, ledger:74:entry:0:entryLength:136
   2023-04-05T16:54:54,053+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] 
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response 
from 
bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
 rc:EOK, ledger:74:entry:99:entryLength:136
   2023-04-05T16:54:54,055+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] 
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response 
from 
bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
 rc:ENOLEDGER, ledger:74:entry:0:entryLength:0
   2023-04-05T16:54:54,056+0000 [ReplicationWorker] DEBUG 
org.apache.bookkeeper.replication.ReplicationWorker - Founds fragments 
[Fragment(LedgerID: 74, FirstEntryID: 0[0], LastKnownEntryID: 99[99], Host: 
[prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181],
 Closed: true)] for replication from ledger: 74
   2023-04-05T16:54:54,056+0000 [ReplicationWorker] DEBUG 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Try to 
choose a new bookie to replace 
prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
 from ensemble 
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
 excluding 
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
   2023-04-05T16:54:54,057+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] 
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response 
from 
bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181
 rc:ENOLEDGER, ledger:74:entry:99:entryLength:0
   2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN  
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to 
find 1 bookies : excludeBookies 
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
 allBookies 
[<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
   2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN  
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to 
choose a bookie: excluded 
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
 fallback to choose bookie randomly from the cluster.
   2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN  
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to 
find 1 bookies : excludeBookies 
[<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>],
 allBookies 
[<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>,
 
<Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
   2023-04-05T16:54:54,058+0000 [ReplicationWorker] WARN  
org.apache.bookkeeper.replication.ReplicationWorker - 
BKNotEnoughBookiesException while replicating the fragment
   org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not 
enough non-faulty bookies available
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectRandomInternal(RackawareEnsemblePlacementPolicyImpl.java:781)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectRandom(RackawareEnsemblePlacementPolicyImpl.java:698)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectFromNetworkLocation(RackawareEnsemblePlacementPolicyImpl.java:587)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.selectFromNetworkLocation(RackawareEnsemblePlacementPolicy.java:206)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectFromNetworkLocation(RackawareEnsemblePlacementPolicyImpl.java:547)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.selectFromNetworkLocation(RackawareEnsemblePlacementPolicy.java:227)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.replaceBookie(RackawareEnsemblePlacementPolicyImpl.java:475)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.replaceBookie(RackawareEnsemblePlacementPolicy.java:120)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.BookKeeperAdmin.getReplacementBookiesByIndexes(BookKeeperAdmin.java:1080)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.client.BookKeeperAdmin.replicateLedgerFragment(BookKeeperAdmin.java:1129)
 ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:377)
 [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:269)
 [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:230)
 [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
           at java.lang.Thread.run(Thread.java:829) [?:?]
   2023-04-05T16:54:54,059+0000 [ReplicationWorker] DEBUG 
org.apache.bookkeeper.client.LedgerChecker - Checking fragment 
Fragment(LedgerID: 74, FirstEntryID: 0[0], LastKnownEntryID: 99[99], Host: 
[prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181,
 
prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181,
 
prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181],
 Closed: true)
   2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0] 
DEBUG org.apache.bookkeeper.proto.ReadEntryProcessor - Received new read 
request: Op(2)[Ledger:74,Entry:0]
   2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0] 
DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - 
Get Entry: 74@0
   2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0] 
DEBUG org.apache.bookkeeper.proto.ReadEntryProcessor - ##### Read entry ##### 
136 -- ref-count: 1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to