ivankelly commented on a change in pull request #2281: [bookie-ledger-recovery]
Fix bookie recovery stuck even with enough ack-quorum response
URL: https://github.com/apache/bookkeeper/pull/2281#discussion_r403306604
##########
File path:
bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java
##########
@@ -137,8 +140,15 @@ public synchronized void readEntryComplete(final int rc,
final long ledgerId, fi
}
if (numResponsesPending == 0 && !completed) {
Review comment:
@eolivelli @sijie
This fix is the wrong fix. This problem is the same problem as #2273 , and
we've also seen internally. The root cause is that quorum coverage is broken.
5e399df broke it. 5e399df was a change in the write path, so why it touched the
quorum coverage in the read path is a mystery. dcdd1e88 supplemented the break
which and added a test case. The test case tests exactly the wrong behaviour
Pre-5e399df it worked correctly.
It seems there is confusion as to what the quorum coverage is supposed to do.
The quorum coverage set is there to check if we have received a valid
response from enough nodes that there cannot exist an entry which has been
acknowledged to the client which has not touched at least one of these nodes.
In other words, there cannot exist an entry which formed an ack quorum without
touching one of the nodes we have talked to.
The quorum coverage set should only have two states for a node, known or
unknown. The known state means that we know definitely than an entry does or
does not exist on that node. So, an OK, NoSuchLedger and NoSuchEntry response
will move the node into a known state. Anything else is an unknown state. No
response from node is unknown state. Node had an IOException is an unknown
state.
Take the test supplied with dcdd1e88:
```
RoundRobinDistributionSchedule schedule = new
RoundRobinDistributionSchedule(
5, 3, 5);
DistributionSchedule.QuorumCoverageSet covSet =
schedule.getCoverageSet();
covSet.addBookie(0, BKException.Code.NoSuchLedgerExistsException);
covSet.addBookie(1, BKException.Code.NoSuchEntryException);
covSet.addBookie(2, BKException.Code.NoSuchLedgerExistsException);
covSet.addBookie(3, BKException.Code.UNINITIALIZED);
covSet.addBookie(4, BKException.Code.UNINITIALIZED);
assertFalse(covSet.checkCovered());
```
For clarify, w is 5, a is 3, e is 5. RoundRobinDistributionSchedule
parameters are messed up.
This test is simply wrong. The coverage set is covered. There is no way an
entry can have ack quorum in any write quorum. There is no set of 3 bookies in
the unknown state.
Anyhow, I have a fix going through CI internally. I'll push up a PR on
monday.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services