Hi Jane, Thank you for the thought-out CEP. I certainly see the use of a feature like this to add resilience during cluster state changes. I have a few questions after reading the CEP.
Driver compatibility: The way I read this, it's based on an ideal scenario where client and server are on the same version to support this feature. In my experience, client rollouts are never complete and often lag far behind the cluster upgrade. What happens when the driver completely ignores GRACEFUL_DISCONNECT? It might mean considering something on the server side. Discovery things: Speaking of the client, you want to use the SUPPORTED as listed in the v4 spec[1], but why not add this to STARTUP? You mention something in the "Rejected alternatives," but could you expand your thinking here? Signal multiplication: You have this in the CEP "Other protocols (HTTP/2, PostgreSQL, Redis Cluster) use connection-local in-band signals to enable safe draining." Our protocol guidance[1] explicitly notes that drivers often keep multiple connections and should not register for events on all of them, as this duplicates traffic. I don't know how you could ensure that every connection would be aware of a GRACEFUL_DISCONNECT without changing that aspect of the spec. Event timing for operators: It's not clear to me when the GRACEFUL_DISCONNECT is emitted when you do something like a drain, disablebinary or just a JVM shutdown hook. This is crucial for operators to understand how this could work and should be in the CEP spec for clarity. I think it will matter to a lot of people. Operator control: I've been on this push for a while and so I have to mention it. Opt-in vs default. We need more controls in the config YAML. graceful_disconnect_enabled If there is a server-side component: graceful_disconnect_grace_period_ms graceful_disconnect_max_drain_ms And finally, it needs more observability... logging/metrics counters: connections_draining, forced_disconnects Thanks for proposing this! Patrick 1 - https://cassandra.apache.org/doc/latest/cassandra/_attachments/native_protocol_v4.html On Tue, Jan 13, 2026 at 4:30 PM Jane H <[email protected]> wrote: > Hi all, > > I’d like to start a discussion on a CEP proposal: *CEP-59: Graceful > Disconnect*, to make intentional node shutdown/drain less disruptive for > clients (link: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406619103 > ). > > Today, intentional node shutdown (e.g., rolling restarts) can still be > disruptive from a client perspective. Drivers often ignore DOWN events > because they are not reliable, and outstanding requests can end up as > client-facing TimeOut exceptions. > > The proposed solution is to add an in-band GRACEFUL_DISCONNECT event that > both control and query connections can opt into via REGISTER. When a node > is shutting down, it will emit the event to all subscribed connections. > Drivers will stop sending new queries on that connection/host, allow > in-flight requests to finish, then reconnect with exponential backoff. > > If you have thoughts on the proposed protocol, server shutdown behavior, > driver expectations, edge cases, or general feedback, I’d really appreciate > it. > > Regards, > Jane >
