Thanks for the CEP Jane! Could you elaborate on why the current OverloadedException behavior is insufficient in 5.x (CASSANDRA-19534)? If a server instance can’t accept a new request, it signals that immediately to the client for retry against another host, then eventually the client should receive a NODE_DOWN event on the control connection. This does mean that clients need to have queries marked as idempotent in order for them to retry, and even non-idempotent queries are safe to retry when they fail in this way. In the past, we’ve discussed a new exception hierarchy that allows the server to indicate whether a non-idempotent query would be safe to retry.
If we can address that drawback of OverloadedException for non-idempotent queries, are there any other drawbacks of the current approach? In your proposal, the server will still need to handle new requests issued from clients after GRACEFUL_DISCONNECT is sent, particularly if the EVENT is delayed or dropped. If those requests are going to get processed, we’ll either have to continue deferring shutdown, or interrupt them and trigger retries later into the client’s latency budget.
