[ https://issues.apache.org/jira/browse/KUDU-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986277#comment-16986277 ]
Adar Dembo commented on KUDU-3013: ---------------------------------- Thanks for the detailed analysis; this test has been flaky for some time. A couple thoughts on how we might fix it: # We could stop the tablet by calling {{TabletReplica::Stop}} instead of {{Tablet::Stop}}. This is a heavier-weight stop operation which actually calls {{Tablet::Shutdown}}, but also waits for all in-flight transactions to finish, so there won't be any danger of a transaction applying concurrently with the tablet stopping. # We may be able to adjust {{Tablet::Stop}} to be safer. One option is to call {{Mvcc::Close}} with the tablet's state_lock_ held, but I'm always nervous about holding multiple locks if we don't strictly need to (as it increases the chance of a deadlock). Another option is to call {{Mvcc::Close}} before tinkering with the tablet's state, so that {{MvccManager::AbortTransaction}} does the right thing when MVCC is closed. From a quick glance over the MVCC code I _think_ in-flight transactions would handle this gracefully, but I'm not sure. [~awong] I think you originally added this test (and the associated stopped states); could you take a look at Lifu's analysis and provide your thoughts on whether the race is test-only or real, and how we might address it? > Race in StopTabletITest.TestStoppedTabletsDontWrite > --------------------------------------------------- > > Key: KUDU-3013 > URL: https://issues.apache.org/jira/browse/KUDU-3013 > Project: Kudu > Issue Type: Bug > Reporter: LiFu He > Priority: Major > Attachments: > jenkins-slave.1575252039.26703.311237e4f4a39e5fea3b175fbf12d3e4aa8674dc.81.0-artifacts.zip > > > I met this issue on Jenkins this morning, and it seems there is a race inĀ > StopTabletITest.TestStoppedTabletsDontWrite. > {code:java} > // code placeholder > TransactionDriver::ApplyTask() Tablet::Stop() > | | > transaction_->Apply() | > | | > tablet->ApplyRowOperations(state()) | > (RESERVED -> APPLYING) | > | | > StartApplying(tx_state); | > | > set_state_unlocked(kStopped); > ApplyRowOperation() | > | | > CheckHasNotBeenStoppedUnlocked() | > (return error since the tablet has been stopped) | > | | > HandleFailure(s) | > | | > transaction_->Finish(Transaction::ABORTED); | > | | > state()->CommitOrAbort(result); | > | | > ReleaseMvccTxn(result); | > | | > mvcc_tx_->Abort(); | > | | > manager_->AbortTransaction(timestamp_); | > | | > if (PREDICT_FALSE(!is_open())) | > | mvcc_.Close(); > | | > | open_.store(false); > CHECK_EQ(old_state, RESERVED) | > (ASSERT failed) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)