ivankelly commented on a change in pull request #2281: [bookie-ledger-recovery] 
Fix bookie recovery stuck even with enough ack-quorum response
URL: https://github.com/apache/bookkeeper/pull/2281#discussion_r403306604
 
 

 ##########
 File path: 
bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java
 ##########
 @@ -137,8 +140,15 @@ public synchronized void readEntryComplete(final int rc, 
final long ledgerId, fi
         }
 
         if (numResponsesPending == 0 && !completed) {
 
 Review comment:
   @eolivelli @sijie 
   
    This fix is the wrong fix. This problem is the same problem as #2273 , and 
we've also seen internally. The root cause is that quorum coverage is broken. 
5e399df broke it. 5e399df was a change in the write path, so why it touched the 
quorum coverage in the read path is a mystery. dcdd1e88 supplemented the break 
which and added a test case. The test case tests exactly the wrong behaviour 
Pre-5e399df it worked correctly. 
   
   It seems there is confusion as to what the quorum coverage is supposed to do.
   
   The quorum coverage set is there to check if we have received a valid 
response from enough nodes that there cannot exist an entry which has been 
acknowledged to the client which has not touched at least one of these nodes. 
In other words, there cannot exist an entry which formed an ack quorum without 
touching one of the nodes we have talked to.
   
   The quorum coverage set should only have two states for a node, known or 
unknown. The known state means that we know definitely than an entry does or 
does not exist on that node. So, an OK, NoSuchLedger and NoSuchEntry response 
will move the node into a known state. Anything else is an unknown state. No 
response from node is unknown state. Node had an IOException is an unknown 
state.
   
   Take the test supplied with dcdd1e88:
   ```
           RoundRobinDistributionSchedule schedule = new 
RoundRobinDistributionSchedule(
               5, 3, 5);
           DistributionSchedule.QuorumCoverageSet covSet = 
schedule.getCoverageSet();
           covSet.addBookie(0, BKException.Code.NoSuchLedgerExistsException);
           covSet.addBookie(1, BKException.Code.NoSuchEntryException);
           covSet.addBookie(2, BKException.Code.NoSuchLedgerExistsException);
           covSet.addBookie(3, BKException.Code.UNINITIALIZED);
           covSet.addBookie(4, BKException.Code.UNINITIALIZED);
           assertFalse(covSet.checkCovered());
   ```
   For clarify, w is 5, a is 3, e is 5. RoundRobinDistributionSchedule 
parameters are messed up.
   
   This test is simply wrong. The coverage set is covered. There is no way an 
entry can have ack quorum in any write quorum. There is no set of 3 bookies in 
the unknown state.
   
   Anyhow, I have a fix going through CI internally. I'll push up a PR on 
monday.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to