[jira] [Updated] (KUDU-1436) Concurrent remote bootstrap calls of same tablet from same server can crash or result in corrupt replicas

Todd Lipcon (JIRA) Sat, 30 Apr 2016 13:54:52 -0700

     [ 
https://issues.apache.org/jira/browse/KUDU-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Todd Lipcon updated KUDU-1436:
------------------------------
    Code Review: http://gerrit.cloudera.org:8080/#/c/2913/

> Concurrent remote bootstrap calls of same tablet from same server can crash 
> or result in corrupt replicas
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-1436
>                 URL: https://issues.apache.org/jira/browse/KUDU-1436
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tserver
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>
> In the case that a BeginRemoteBootstrapSession call times out, it's possible 
> that the client will send a second call and both get processed. This triggers 
> the following race, if the second call gets processed first:
> - C2: initializing remote bootstrap session
> - C1: waiting on lock
> - C2: finishes initializing, drops lock, and starts to copy the tablet 
> metadata out of the session object (outside of any lock)
> - C1: acquires lock and follows the "Re-initializing" code path. This code 
> path calls Clear() on its snapshots of the tablet metadata.
> - C2: may crash or copy an incomplete copy of the metadata (eg with missing 
> fields)
> - C2 responds to the client
> This can cause a number of issues:
> - If C2 ends up getting a partially-initialized metadata protobuf, we can 
> trigger a crash in RPC (we don't handle sending responses that have missing 
> required fields)
> - C2 might actually get a fully correct response back to the client. But, in 
> the meantime C1 has managed to un-anchor and re-anchor logs and blocks. This 
> means that C2 will eventually copy log entries which are newer than its 
> metadata snapshot which can trigger an assertion like KUDU-1046
> Basically all bets are off when this race is triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KUDU-1436) Concurrent remote bootstrap calls of same tablet from same server can crash or result in corrupt replicas

Reply via email to