[
https://issues.apache.org/jira/browse/KUDU-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989321#comment-16989321
]
Andrew Wong commented on KUDU-3013:
-----------------------------------
Sorry this slipped through my notifications. [~helifu] thanks for your
analysis! I think this may be a real issue since disk-failure-handling code
will call {{Tablet::Stop()}}, not {{TabletReplica::Stop()}} (see
{{TabletReplica::MakeUnavailable()}} which gets called by
{{TSTabletManager::FailTabletAndScheduleShutdown()}}.
{quote}
Another option is to call Mvcc::Close before tinkering with the tablet's state,
so that MvccManager::AbortTransaction does the right thing when MVCC is closed.
From a quick glance over the MVCC code I think in-flight transactions would
handle this gracefully, but I'm not sure.
{quote}
I _would_ like this approach, but I'm not sure that it's safe. Op failures will
generally lead to checks against the tablet's state, not the MVCC state. That
would mean we could end up aborting a transaction due to MVCC being closed, but
because we haven't locked or changed the tablet's state, callers of the
{{apply}} would see a failure while the tablet isn't closed. So, while it's
somewhat nerve-wracking, I would consider calling {{Mvcc::Close()}} under the
tablet's lock. I'll ponder this some more and try putting up a fix.
> Race in StopTabletITest.TestStoppedTabletsDontWrite
> ---------------------------------------------------
>
> Key: KUDU-3013
> URL: https://issues.apache.org/jira/browse/KUDU-3013
> Project: Kudu
> Issue Type: Bug
> Reporter: LiFu He
> Priority: Major
> Attachments:
> jenkins-slave.1575252039.26703.311237e4f4a39e5fea3b175fbf12d3e4aa8674dc.81.0-artifacts.zip
>
>
> I met this issue on Jenkins this morning, and it seems there is a race inĀ
> StopTabletITest.TestStoppedTabletsDontWrite.
> {code:java}
> // code placeholder
> TransactionDriver::ApplyTask() Tablet::Stop()
> | |
> transaction_->Apply() |
> | |
> tablet->ApplyRowOperations(state()) |
> (RESERVED -> APPLYING) |
> | |
> StartApplying(tx_state); |
> |
> set_state_unlocked(kStopped);
> ApplyRowOperation() |
> | |
> CheckHasNotBeenStoppedUnlocked() |
> (return error since the tablet has been stopped) |
> | |
> HandleFailure(s) |
> | |
> transaction_->Finish(Transaction::ABORTED); |
> | |
> state()->CommitOrAbort(result); |
> | |
> ReleaseMvccTxn(result); |
> | |
> mvcc_tx_->Abort(); |
> | |
> manager_->AbortTransaction(timestamp_); |
> | |
> if (PREDICT_FALSE(!is_open())) |
> | mvcc_.Close();
> | |
> | open_.store(false);
> CHECK_EQ(old_state, RESERVED) |
> (ASSERT failed)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)