[
https://issues.apache.org/jira/browse/KUDU-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated KUDU-1436:
------------------------------
Code Review: http://gerrit.cloudera.org:8080/#/c/2913/
> Concurrent remote bootstrap calls of same tablet from same server can crash
> or result in corrupt replicas
> ---------------------------------------------------------------------------------------------------------
>
> Key: KUDU-1436
> URL: https://issues.apache.org/jira/browse/KUDU-1436
> Project: Kudu
> Issue Type: Bug
> Components: consensus, tserver
> Affects Versions: 0.8.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Blocker
>
> In the case that a BeginRemoteBootstrapSession call times out, it's possible
> that the client will send a second call and both get processed. This triggers
> the following race, if the second call gets processed first:
> - C2: initializing remote bootstrap session
> - C1: waiting on lock
> - C2: finishes initializing, drops lock, and starts to copy the tablet
> metadata out of the session object (outside of any lock)
> - C1: acquires lock and follows the "Re-initializing" code path. This code
> path calls Clear() on its snapshots of the tablet metadata.
> - C2: may crash or copy an incomplete copy of the metadata (eg with missing
> fields)
> - C2 responds to the client
> This can cause a number of issues:
> - If C2 ends up getting a partially-initialized metadata protobuf, we can
> trigger a crash in RPC (we don't handle sending responses that have missing
> required fields)
> - C2 might actually get a fully correct response back to the client. But, in
> the meantime C1 has managed to un-anchor and re-anchor logs and blocks. This
> means that C2 will eventually copy log entries which are newer than its
> metadata snapshot which can trigger an assertion like KUDU-1046
> Basically all bets are off when this race is triggered.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)