[jira] [Commented] (CASSANDRA-1451) Shutting down a node cleanly still kills client requests when the node goes down
[ https://issues.apache.org/jira/browse/CASSANDRA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033695#comment-13033695 ] Jonathan Ellis commented on CASSANDRA-1451: --- Instead of trying to make this an integrated part of drain, if we had a manually shut down gossip jmx control, we could have a simple workflow of 1) shut down gossip 2) wait for everyone to mark node-to-drain as down 3) drain with no further changes required to internals, and no new exception types to introduce. Shutting down a node cleanly still kills client requests when the node goes down -- Key: CASSANDRA-1451 URL: https://issues.apache.org/jira/browse/CASSANDRA-1451 Project: Cassandra Issue Type: New Feature Components: Core Reporter: David King Priority: Minor Fix For: 1.0 Shutting down a node, even more cleanly through drain, still kills some requests with timeoutexceptions. Ideally, operations would not be sent at all to nodes that are known to be shutting down, perhaps by shutting down gossip before starting the draining process. Other nodes will still need to have the phi convict threshold exceeded, but presumably that's usually shorter than drain -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1451) Shutting down a node cleanly still kills client requests when the node goes down
[ https://issues.apache.org/jira/browse/CASSANDRA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033806#comment-13033806 ] Gary Dusbabek commented on CASSANDRA-1451: -- We do have such a control. Shutting down a node cleanly still kills client requests when the node goes down -- Key: CASSANDRA-1451 URL: https://issues.apache.org/jira/browse/CASSANDRA-1451 Project: Cassandra Issue Type: New Feature Components: Core Reporter: David King Priority: Minor Fix For: 1.0 Shutting down a node, even more cleanly through drain, still kills some requests with timeoutexceptions. Ideally, operations would not be sent at all to nodes that are known to be shutting down, perhaps by shutting down gossip before starting the draining process. Other nodes will still need to have the phi convict threshold exceeded, but presumably that's usually shorter than drain -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-1451) Shutting down a node cleanly still kills client requests when the node goes down
[ https://issues.apache.org/jira/browse/CASSANDRA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992806#comment-12992806 ] Matthew F. Dennis commented on CASSANDRA-1451: -- now that CASSANDRA-1108 is in place, it seems like whenever we start draining we: 1) shutdown gossiper. This will prevent future messages from other nodes (once it propagates) 2) immediately reply to any request with a TException(node shutting down). This will allow clients to distinguish between node is busy and node is not playing right now. This will also allow other nodes to continue instead of waiting for RPCTimeout. thoughts? Shutting down a node cleanly still kills client requests when the node goes down -- Key: CASSANDRA-1451 URL: https://issues.apache.org/jira/browse/CASSANDRA-1451 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.6.5 Reporter: David King Shutting down a node, even more cleanly through drain, still kills some requests with timeoutexceptions. Ideally, operations would not be sent at all to nodes that are known to be shutting down, perhaps by shutting down gossip before starting the draining process. Other nodes will still need to have the phi convict threshold exceeded, but presumably that's usually shorter than drain -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-1451) Shutting down a node cleanly still kills client requests when the node goes down
[ https://issues.apache.org/jira/browse/CASSANDRA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934831#action_12934831 ] Gary Dusbabek commented on CASSANDRA-1451: -- An approach: More cleanly: introduce a gossip state in conjunction with ApplicationState.STATUS that basically proclaims I'm up, but stop routing requests to me. (e.g.: see StorageService.startLeaving()). But then you'd be at the mercy of relying on when that information makes it to every node. We might already have a such a state, but it doesn't imply these semantics. Even more cleanly: When a node is in that state and it receives a request from another node that doesn't know it, have send a message that politely explains the situation and please stop sending me requests. Ideally, this would be done by forcing a gossip to the node that doesn't know the leaving node doesn't want requests (as opposed to creating a new message, verb handler, etc.). Shutting down a node cleanly still kills client requests when the node goes down -- Key: CASSANDRA-1451 URL: https://issues.apache.org/jira/browse/CASSANDRA-1451 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.6.5 Reporter: David King Shutting down a node, even more cleanly through drain, still kills some requests with timeoutexceptions. Ideally, operations would not be sent at all to nodes that are known to be shutting down, perhaps by shutting down gossip before starting the draining process. Other nodes will still need to have the phi convict threshold exceeded, but presumably that's usually shorter than drain -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1451) Shutting down a node cleanly still kills client requests when the node goes down
[ https://issues.apache.org/jira/browse/CASSANDRA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934833#action_12934833 ] Jonathan Ellis commented on CASSANDRA-1451: --- bq. When a node is in that state and it receives a request from another node that doesn't know it, have send a message that politely explains the situation and please stop sending me requests. you're basically guaranteed to get these from every node in the cluster b/c of gossip delay, if it's under load. i'd say let's use gossip and accept there will be some delay. Shutting down a node cleanly still kills client requests when the node goes down -- Key: CASSANDRA-1451 URL: https://issues.apache.org/jira/browse/CASSANDRA-1451 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.6.5 Reporter: David King Shutting down a node, even more cleanly through drain, still kills some requests with timeoutexceptions. Ideally, operations would not be sent at all to nodes that are known to be shutting down, perhaps by shutting down gossip before starting the draining process. Other nodes will still need to have the phi convict threshold exceeded, but presumably that's usually shorter than drain -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1451) Shutting down a node cleanly still kills client requests when the node goes down
[ https://issues.apache.org/jira/browse/CASSANDRA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934646#action_12934646 ] Erik Onnen commented on CASSANDRA-1451: --- Here's what we observed that lead to this being discussed in IRC. When executing nodetool drain, a node is no-longer able to accept new write operations. This is problematic for several reasons in the current implementation: 1) The drain node actually accepts writes, just won't process them locally but it will ship writes to remote endpoints. In 0.6.8, the write can actually be successful, even though a timeout error is reported back to the client when the local write fails causing the client to think the write fails when it in fact succeeded. 2) The drain node can still process some writes, just not writes for which it is a natural endpoint. This leads to non-deterministic behavior for clients where some writes succeed, but others fail. 3) The drain node can still process reads. This causes some upstream client libraries to think the node is healthy when in reality it should be shunned (at least for writes). 4) When a local write is rejected, it surfaces as a timeout exception. This is the same behavior that happens when pending read/write stage operations are full. In many cases, it's proper for a client to retry when read/write are full but due to how this appears to the client, the client cannot distinguish whether read/writes are backed up, or if the local node is simply rejecting the write as a result of being in a drain. The clients can't self-help in this case, they're left to guess which is bad. Shutting down a node cleanly still kills client requests when the node goes down -- Key: CASSANDRA-1451 URL: https://issues.apache.org/jira/browse/CASSANDRA-1451 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.6.5 Reporter: David King Shutting down a node, even more cleanly through drain, still kills some requests with timeoutexceptions. Ideally, operations would not be sent at all to nodes that are known to be shutting down, perhaps by shutting down gossip before starting the draining process. Other nodes will still need to have the phi convict threshold exceeded, but presumably that's usually shorter than drain -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1451) Shutting down a node cleanly still kills client requests when the node goes down
[ https://issues.apache.org/jira/browse/CASSANDRA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934652#action_12934652 ] Erik Onnen commented on CASSANDRA-1451: --- I'm happy to work on a fix for this if someone can point me in the right direction for getting started. Shutting down a node cleanly still kills client requests when the node goes down -- Key: CASSANDRA-1451 URL: https://issues.apache.org/jira/browse/CASSANDRA-1451 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.6.5 Reporter: David King Shutting down a node, even more cleanly through drain, still kills some requests with timeoutexceptions. Ideally, operations would not be sent at all to nodes that are known to be shutting down, perhaps by shutting down gossip before starting the draining process. Other nodes will still need to have the phi convict threshold exceeded, but presumably that's usually shorter than drain -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.