jemc opened a new issue, #2853: URL: https://github.com/apache/kvrocks/issues/2853
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/kvrocks/issues) and found no similar issues. ### Version This is data race bug is present both on the latest release tag (2.11.1) and on the latest unstable branch commit (1639c90f094728fee860c3c0f518350df40ecb90), here: https://github.com/apache/kvrocks/blob/1639c90f094728fee860c3c0f518350df40ecb90/src/server/server.h#L202 ### Minimal reproduce step This bug was discovered just by reading the code. Data races are usually difficult to trigger reliably in practice, so I don't have steps which would reliably trigger misbehavior at runtime. ### What did you expect to see? I expect to see writable fields of `Server` to be guarded by either locks or atomic operations, for both writes and reads, if that field is writable/readable from concurrent threads, to avoid data races. Failing to do so results in undefined behavior, up to and including SIGSEGV. ### What did you see instead? The `Server::IsSlave()` method is implemented by reading the `master_host_` field and checking if it's an empty string: https://github.com/apache/kvrocks/blob/1639c90f094728fee860c3c0f518350df40ecb90/src/server/server.h#L202 This field can be written to at runtime, whenever a new master is assigned (e.g. via the `REPLICAOF`/`SLAVEOF` commands), inside the `Server::AddMaster` method here: https://github.com/apache/kvrocks/blob/a2051946b655d46ffbdfc93b71b8f4d91d9d4868/src/server/server.cc#L305 That method which writes to this field is guarded by the `slaveof_mu_` lock here: https://github.com/apache/kvrocks/blob/a2051946b655d46ffbdfc93b71b8f4d91d9d4868/src/server/server.cc#L278 But the `Server::IsSlave()` method is _not_ guarded by the `slaveof_mu_` lock: https://github.com/apache/kvrocks/blob/1639c90f094728fee860c3c0f518350df40ecb90/src/server/server.h#L202 And it is called from call sites that definitely originate in other threads, and are _not_ guarded by the `slaveof_mu_` lock. One such example (though there are probably more) is inside the `Connection::ExecuteCommands()` method, here: https://github.com/apache/kvrocks/blob/1639c90f094728fee860c3c0f518350df40ecb90/src/server/redis_connection.cc#L488-L491 In that example, if a `REPLICAOF`/`SLAVEOF` command (which writes this field via the `Server::AddMaster()` method) is being processed in parallel (on multiple worker threads) with any other command (which reads this field via the `Server::IsSlave()` method that is checked before executing), then we will have a write colliding with a parallel read to the same memory - a classic example of a data race. Undefined behavior will result, up to and including a potential `SIGSEGV`. ### Anything Else? Note that introducing a guard with the `slaveof_mu_` lock inside `Server::IsSlave` would solve this issue, but it will also have a performance cost. It might be worth coming up with a design based on atomic operations instead, though that will require more subtle thinking and will still not be zero-overhead. ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
