Will Berkeley has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
......................................................................

KUDU-1678: Race during abort of pending operations during raft shutdown

When a tablet replica is shutting down, the following race can occur:

0: The replica receives an ALTER_SCHEMA op adding the column 'foo'.
1: The replica receives a WRITE_OP inserting a row with column 'foo'
   present.
2: The replica starts to abort its pending operations because it is
   shutting down. The ALTER_SCHEMA is aborted.
3: Before the WRITE_OP can be aborted, it replicates.
4: The tablet server crashes as a result, with a message like:

F1023 20:49:18.088703  1409 transaction_driver.cc:382] T 
3fe5651e31d8486780d0e480f8748ead P 12e38846d8c648df8a11b9d8da2ad407 S R-NP Ts 
6309182492379934720: Cannot cancel transactions that have already replicated: 
Invalid argument: Client provided column c587 INT32 NOT NULL not present in 
tablet

The tablet server will crash with this same message every time it tries
to bootstrap. Other tablet servers hosting replicas may crash if they
replicated the bad write.

The problem is that transactions need to be aborted in reverse order of
their index, since later transactions may depend on earlier ones.

This bug reproduced about 1% of the time in alter_table-randomized-test
when run in DEBUG mode with 16 stress threads: I saw it 10 times in 1000
runs. With this change, I saw 0 failures in 1000 runs.

Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Reviewed-on: http://gerrit.cloudera.org:8080/11759
Reviewed-by: Alexey Serbin <aser...@cloudera.com>
Tested-by: Will Berkeley <wdberke...@gmail.com>
---
M src/kudu/consensus/pending_rounds.cc
M src/kudu/integration-tests/alter_table-randomized-test.cc
2 files changed, 23 insertions(+), 13 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Will Berkeley: Verified

--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wdberke...@gmail.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>

Reply via email to