Todd Lipcon created KUDU-1436:
---------------------------------
Summary: Concurrent remote bootstrap calls from same server can
crash or result in corrupt replicas
Key: KUDU-1436
URL: https://issues.apache.org/jira/browse/KUDU-1436
Project: Kudu
Issue Type: Bug
Components: consensus, tserver
Affects Versions: 0.8.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
In the case that a BeginRemoteBootstrapSession call times out, it's possible
that the client will send a second call and both get processed. This triggers
the following race, if the second call gets processed first:
- C2: initializing remote bootstrap session
- C1: waiting on lock
- C2: finishes initializing, drops lock, and starts to copy the tablet metadata
out of the session object (outside of any lock)
- C1: acquires lock and follows the "Re-initializing" code path. This code path
calls Clear() on its snapshots of the tablet metadata.
- C2: may crash or copy an incomplete copy of the metadata (eg with missing
fields)
- C2 responds to the client
This can cause a number of issues:
- If C2 ends up getting a partially-initialized metadata protobuf, we can
trigger a crash in RPC (we don't handle sending responses that have missing
required fields)
- C2 might actually get a fully correct response back to the client. But, in
the meantime C1 has managed to un-anchor and re-anchor logs and blocks. This
means that C2 will eventually copy log entries which are newer than its
metadata snapshot which can trigger an assertion like KUDU-1046
Basically all bets are off when this race is triggered.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)