[ 
https://issues.apache.org/jira/browse/KUDU-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986277#comment-16986277
 ] 

Adar Dembo commented on KUDU-3013:
----------------------------------

Thanks for the detailed analysis; this test has been flaky for some time. A 
couple thoughts on how we might fix it:
# We could stop the tablet by calling {{TabletReplica::Stop}} instead of 
{{Tablet::Stop}}. This is a heavier-weight stop operation which actually calls 
{{Tablet::Shutdown}}, but also waits for all in-flight transactions to finish, 
so there won't be any danger of a transaction applying concurrently with the 
tablet stopping.
# We may be able to adjust {{Tablet::Stop}} to be safer. One option is to call 
{{Mvcc::Close}} with the tablet's state_lock_ held, but I'm always nervous 
about holding multiple locks if we don't strictly need to (as it increases the 
chance of a deadlock). Another option is to call {{Mvcc::Close}} before 
tinkering with the tablet's state, so that {{MvccManager::AbortTransaction}} 
does the right thing when MVCC is closed. From a quick glance over the MVCC 
code I _think_ in-flight transactions would handle this gracefully, but I'm not 
sure.

[~awong] I think you originally added this test (and the associated stopped 
states); could you take a look at Lifu's analysis and provide your thoughts on 
whether the race is test-only or real, and how we might address it?

> Race in StopTabletITest.TestStoppedTabletsDontWrite
> ---------------------------------------------------
>
>                 Key: KUDU-3013
>                 URL: https://issues.apache.org/jira/browse/KUDU-3013
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: LiFu He
>            Priority: Major
>         Attachments: 
> jenkins-slave.1575252039.26703.311237e4f4a39e5fea3b175fbf12d3e4aa8674dc.81.0-artifacts.zip
>
>
> I met this issue on Jenkins this morning, and it seems there is a race inĀ 
> StopTabletITest.TestStoppedTabletsDontWrite.
> {code:java}
> // code placeholder
> TransactionDriver::ApplyTask()                            Tablet::Stop()
>           |                                                     |
>   transaction_->Apply()                                         |
>           |                                                     |
> tablet->ApplyRowOperations(state())                             |
> (RESERVED -> APPLYING)                                          |
>           |                                                     |
>  StartApplying(tx_state);                                       |
>           |                                          
> set_state_unlocked(kStopped);
>   ApplyRowOperation()                                           |
>           |                                                     |
> CheckHasNotBeenStoppedUnlocked()                                |
> (return error since the tablet has been stopped)                |
>           |                                                     |
>     HandleFailure(s)                                            |
>           |                                                     |
> transaction_->Finish(Transaction::ABORTED);                     |
>           |                                                     |
> state()->CommitOrAbort(result);                                 |
>           |                                                     |
> ReleaseMvccTxn(result);                                         |
>           |                                                     |
> mvcc_tx_->Abort();                                              |
>           |                                                     |
> manager_->AbortTransaction(timestamp_);                         |
>           |                                                     |
> if (PREDICT_FALSE(!is_open()))                                  |
>           |                                                 mvcc_.Close();
>           |                                                     |
>           |                                               open_.store(false);
> CHECK_EQ(old_state, RESERVED)                                   |
>    (ASSERT failed)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to