Kaustubh1204 commented on issue #268: URL: https://github.com/apache/kvrocks-controller/issues/268#issuecomment-3892268749
Hi @RiversJin and @git-hulk, I would like to take ownership of this issue. From my understanding, the core problem lies in the non-atomic update sequence between Kvrocks (data plane) and etcd (control plane). If SetSlot succeeds but UpdateCluster fails, the version in Kvrocks can advance beyond the version stored in etcd, leading to a divergence that may block subsequent controller operations. This creates a potential deadlock scenario where future version increments are rejected because they are no longer strictly greater than the already-applied version. Reversing the update order (etcd first, then Kvrocks) can mitigate this specific failure mode, but I believe we should also clearly define version authority and consider reconciliation logic to handle cases where Kvrocks is already ahead. Ensuring idempotency and convergence under partial failures will be critical. My plan is to: Trace the full slot migration flow and version bump logic. Identify all failure points between SetSlot and UpdateCluster. Propose a safe update sequence with proper CAS/version checks. Evaluate whether reconciliation logic is needed when Kvrocks reports a higher version. I’ll share a more concrete design proposal before implementing changes to ensure alignment. Please let me know if you have any specific constraints or considerations I should keep in mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
