[
https://issues.apache.org/jira/browse/KUDU-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986277#comment-16986277
]
Adar Dembo commented on KUDU-3013:
----------------------------------
Thanks for the detailed analysis; this test has been flaky for some time. A
couple thoughts on how we might fix it:
# We could stop the tablet by calling {{TabletReplica::Stop}} instead of
{{Tablet::Stop}}. This is a heavier-weight stop operation which actually calls
{{Tablet::Shutdown}}, but also waits for all in-flight transactions to finish,
so there won't be any danger of a transaction applying concurrently with the
tablet stopping.
# We may be able to adjust {{Tablet::Stop}} to be safer. One option is to call
{{Mvcc::Close}} with the tablet's state_lock_ held, but I'm always nervous
about holding multiple locks if we don't strictly need to (as it increases the
chance of a deadlock). Another option is to call {{Mvcc::Close}} before
tinkering with the tablet's state, so that {{MvccManager::AbortTransaction}}
does the right thing when MVCC is closed. From a quick glance over the MVCC
code I _think_ in-flight transactions would handle this gracefully, but I'm not
sure.
[~awong] I think you originally added this test (and the associated stopped
states); could you take a look at Lifu's analysis and provide your thoughts on
whether the race is test-only or real, and how we might address it?
> Race in StopTabletITest.TestStoppedTabletsDontWrite
> ---------------------------------------------------
>
> Key: KUDU-3013
> URL: https://issues.apache.org/jira/browse/KUDU-3013
> Project: Kudu
> Issue Type: Bug
> Reporter: LiFu He
> Priority: Major
> Attachments:
> jenkins-slave.1575252039.26703.311237e4f4a39e5fea3b175fbf12d3e4aa8674dc.81.0-artifacts.zip
>
>
> I met this issue on Jenkins this morning, and it seems there is a race inĀ
> StopTabletITest.TestStoppedTabletsDontWrite.
> {code:java}
> // code placeholder
> TransactionDriver::ApplyTask() Tablet::Stop()
> | |
> transaction_->Apply() |
> | |
> tablet->ApplyRowOperations(state()) |
> (RESERVED -> APPLYING) |
> | |
> StartApplying(tx_state); |
> |
> set_state_unlocked(kStopped);
> ApplyRowOperation() |
> | |
> CheckHasNotBeenStoppedUnlocked() |
> (return error since the tablet has been stopped) |
> | |
> HandleFailure(s) |
> | |
> transaction_->Finish(Transaction::ABORTED); |
> | |
> state()->CommitOrAbort(result); |
> | |
> ReleaseMvccTxn(result); |
> | |
> mvcc_tx_->Abort(); |
> | |
> manager_->AbortTransaction(timestamp_); |
> | |
> if (PREDICT_FALSE(!is_open())) |
> | mvcc_.Close();
> | |
> | open_.store(false);
> CHECK_EQ(old_state, RESERVED) |
> (ASSERT failed)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)