Todd Lipcon has submitted this change and it was merged.
Change subject: KUDU-1477. Pending COMMIT message for failed write operation
can prevent tablet startup
KUDU-1477. Pending COMMIT message for failed write operation can prevent tablet
This fixes a bug seen in a recent YCSB stress test that I ran
in which I was accidentally writing tens of thousands of duplicate
keys per second. After a tablet server restarted, it failed to come
up due to a pending commit which referred to no mutated stores
(e.g. because all of the operations were duplicate key inserts).
This patch tweaks the logic for this safety check: a commit with no
mutated stores trivially has "no active stores". However, that's not
the same as having "only inactive stores" -- the subtlety is in the
difference in behavior when a commit has no stores at all.
The patch adds a new targeted test in tablet_bootstrap-test as well as
a more end-to-end test in ts_recovery-itest. Both reproduced the bug
reliably before this patch.
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <dral...@apache.org>
(cherry picked from commit 6894438a406a635dc8a8f3bd77862294163cc7fb)
Reviewed-by: Todd Lipcon <t...@apache.org>
7 files changed, 148 insertions(+), 23 deletions(-)
Todd Lipcon: Looks good to me, approved
Kudu Jenkins: Verified
To view, visit http://gerrit.cloudera.org:8080/3464
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>