[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094684#comment-13094684 ] Jonathan Ellis commented on CASSANDRA-2034: --- what messages is RemoveTest waiting for that are causing the problem? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, 2034-v16.txt, 2034-v17.txt, 2034-v18.txt, 2034-v19-rebased.txt, 2034-v19.txt, 2034-v20.txt, 2034-v21.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v15.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094699#comment-13094699 ] Patricio Echague commented on CASSANDRA-2034: - I fixed the test. If I'm not wrong, {code} for (InetAddress host : hosts) { Message msg = new Message(host, StorageService.Verb.REPLICATION_FINISHED, new byte[0], MessagingService.version_); MessagingService.instance().sendRR(msg, FBUtilities.getBroadcastAddress()); } {code} sends 5 messages but some of them (4) are caught by a local SinkManager implementation(for testing purposes) and not processed. And since you added a wait for the callbacks to be processed before exiting, I had to add a force shutdown (for testing purposes) in order to make the test complete successfully. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, 2034-v16.txt, 2034-v17.txt, 2034-v18.txt, 2034-v19-rebased.txt, 2034-v19.txt, 2034-v20.txt, 2034-v21.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v15.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089922#comment-13089922 ] Patricio Echague commented on CASSANDRA-2034: - Jonathan, I noticed you modified a bit CallbackInfo.shoudHint() {code} public boolean shouldHint() { return message != null StorageProxy.shouldHint(target); } {code} Not sure if you meant to say that your changes addresses the issue of not hinting when CL is not reached. The new shoudHint method you added should be ok as it is processed upon RPCTimeout disregard if the CL was achieved or not. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, 2034-v16.txt, 2034-v17.txt, 2034-v18.txt, 2034-v19.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v15.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089943#comment-13089943 ] Jonathan Ellis commented on CASSANDRA-2034: --- Right, that's what I was referring to when I said [the old shouldHint] does not achieve our goal of making read-repair unnecessary. For that, we need to always hint when an attempted write fails. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, 2034-v16.txt, 2034-v17.txt, 2034-v18.txt, 2034-v19.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v15.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085858#comment-13085858 ] Patricio Echague commented on CASSANDRA-2034: - Thanks Jonathan for the snippet of code. I didn't notice it was broken. I don't see where CallbackInfo.shouldHint is broken. {code} public boolean shouldHint() { if (StorageProxy.shouldHint(target) isMutation) { try { 1) ((IWriteResponseHandler) callback).get(); return true; } catch (TimeoutException e) { // CL was not achieved. We should not hint. } } return false; } {code} I process the callback after the message expired. If the CL was achieved (and the requirement for a hint are gathered) I return true for this target meaning that a hint needs to be written. On the other hand, if the message expire and the CL was not achieved, then I return FALSE (for this target). Perhaps it needs a special treatment during the shutdown ? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085966#comment-13085966 ] Patricio Echague commented on CASSANDRA-2034: - {quote} currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors {quote} Interesting. I must have forgotten it after one of the patches. I remember fixing it before. {quote} Yes. We should probably either wait for the messages to time out (which is mildly annoying to the user) or just write hints for everything (which may be confusing: why are there hints being sent after I restart, when no node was ever down?) I don't see a perfect solution. {quote} I think I prefer make the user wait for RPCTimeout since it is not that much and perhaps puts a bit more clarity than just saving the hints just in case. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, 2034-v16.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v15.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085978#comment-13085978 ] Patricio Echague commented on CASSANDRA-2034: - {quote} Also, still need to address this: currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors {quote} hintsInProgress.incrementAndGet(); happens outside of the executor and actually before scheduling it. totalHints.incrementAndGet(); on the other hand, the totalHint is incremented right after the hint was written and within the task. Is that not right ? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, 2034-v16.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v15.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085142#comment-13085142 ] Patricio Echague commented on CASSANDRA-2034: - I can do that. Also, I'm not quite happy with waiting for local hints to complete per mutation. I'm thinking of adding them to the handler so that we can wait for the hints after scheduling all the mutations. It has pros and cons: Pros: If the coordinator node is overwhelmed, we can tell the client right away. Cons: Por large mutations, we are actually blocking for local hints (if any) per mutation which is not ideal. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085151#comment-13085151 ] Jonathan Ellis commented on CASSANDRA-2034: --- Yes, that should be in mutate() or the handler so the waiting is parallelized. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085497#comment-13085497 ] Jonathan Ellis commented on CASSANDRA-2034: --- Hmm. I think you're right: it would work better to do the hints in the handler instead of passing these lists around. Sorry; let's change it to do it that way. Other notes: SP.shouldHint is broken (will always return true when hints are disabled). I would write it like this: {code} public static boolean shouldHint(InetAddress ep) { if (!isHintedHandoffEnabled()) return false; boolean hintWindowExpired = Gossiper.instance.getEndpointDowntime(ep) maxHintWindow; if (hintWindowExpired) logger.debug(not hinting {} which has been down {}ms, ep, Gossiper.instance.getEndpointDowntime(ep)); return !hintWindowExpired; } {code} CallbackInfo.shouldHint is broken a different way. It should be returning true if and only if the write to the target failed. (Calling this variable from is odd -- from is used to refer to localhost in a MessageService context.) Currently, it returns true if the overall CL is achieved, which in the general case tells us nothing about the individual replica in question. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v14.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084779#comment-13084779 ] Jonathan Ellis commented on CASSANDRA-2034: --- I don't think keeping passing a list of unavailableEndpoints everywhere is actually necessary. I may be missing a use case, but what I see is - in sendToHintedEndpoints - in assureSufficientLiveNodes implementation Both of which can be replaced in a straightforward manner with FailureDetector calls. (Note that it is not necessary for FD state to remain unchanged between assureSufficient and sending.) In fact using the same list in both places is a bug: assureSufficient only cares about what FD thinks, so mixing hinted-handoff-enabledness in as getUnavailableEndpoints does will cause assureSufficient to return false positives w/ HH off. So I'd make assureSufficient use FD directly, and sendTHE use FD + HH state. Bonus: no List allocation in the common case of everything is healthy. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v12.patch, CASSANDRA-2034-trunk-v13.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084154#comment-13084154 ] Jonathan Ellis commented on CASSANDRA-2034: --- - there's an unused overload of MS.addCallback - shouldHint should probably be a CallbackInfo method - the .warn in scheduleMH should be .error - any reason scheduleMH is protected instead of private? - technically MS.shutdown should probably collect the Futures of the hints being written and wait on them in {code} if (consistencyLevel != null consistencyLevel == ConsistencyLevel.ANY) {code} did you mean this? {code} if (responseHandler != null consistencyLevel == ConsistencyLevel.ANY) {code} - not a big fan of the subclass just exposes a different constructor than its parent idiom. I think the normal expectation is that a subclass encapsulates some kind of different behavior than its parent. I'd get rid of CIWM and just expose a with-Message constructor in CallbackInfo. - Would also make CI message and isMutation final - avoid declaring @Override on abstract methods (looking at writeLocalHint Runnable, also CTAF, may be others) - avoid comments that repeat what the code says, like // One more task in the hints queue. - would name hintCounter - totalHints (I had to look at usages to figure out what it does) - there's no actual hint queue anymore so would name currentHintsQueueSize - hintsInProgress (similarly, maxHintsQueueSize) - prefer camelCase variable names (CTAF overall_timeout) - unit.toMillis(timeout) would be more idiomatic than TimeUnit.MILLISECONDS.convert(timeout, unit) - otherwise CTAF is a good clean encapsulation, nice job - generics is upset about hintsFutures.add(new CreationTimeAwareFuture(hintfuture)) -- can you fix the unchecked warning there? - unnecessary whitespace added to the method below // wait for writes. throws TimeoutException if necessary - would prefer to avoid allocating the hintFutures list unless we actually need to write hints, since this is on the write inner loop - still think we can simplify sendToHinted by getting rid of getHintedEndpoints and operating directly on the raw writeEndpoints (and consulting FD to decide whether to write a hint) - currentHintsQueueSize increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors - it would be nice if we could tell the CallbackInfo whether the original write reached ConsistencyLevel. Maybe we could do this by changing isMutation to volatile isHintRequired, something like that. (And just set it to false if CL is not reached.) If not, we don't have to write hints for timed out replicas -- this will help avoiding OOM if nodes in the cluster start getting overloaded and dropping writes. Otherwise, the coordinator could run out of memory queuing up hints for writes that timed out and the client will retry. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084164#comment-13084164 ] Jonathan Ellis commented on CASSANDRA-2034: --- - the timeout-based waitOnFutures overload should only accept CTAF objects since it will NOT behave as expected with others, e.g., FutureTask objects Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083486#comment-13083486 ] Jonathan Ellis commented on CASSANDRA-2034: --- I think v11 is missing the new Callback classes. Can you rebase to trunk when you add those? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v10.patch, CASSANDRA-2034-trunk-v11.patch, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082492#comment-13082492 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. I don't think local hints need to be put on their own queue / thread-pool. Just write the hint to the local mutation queue and increment the hint counters Agreed, that is a better solution than complicating things with extra queues / executors. Since the message dropping is done by the Task, not the executor, there's no problem in that respect. And we can use a simple counter to avoid clogging the queue with hints. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk-v9.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080979#comment-13080979 ] T Jake Luciani commented on CASSANDRA-2034: --- Comments on v8: - You are still waiting on hints in the client, I don't think we need CreationTimeAwareFuture anymore. - I don't think local hints needs to be put on it's own queue / thread-pool. Just write the hint to the local mutation queue and increment the hint counters. - The shutdown process should not wait for the mutation map to expire, it should simply write any outstanding hints from the map and shutdown. Nitpik: - The new expiring map logic is backwards in my opinion. I would rather see the expiration callback handle the hint write, then see the MessageService call storageproxy. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk-v8.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13079936#comment-13079936 ] Lior Golan commented on CASSANDRA-2034: --- Writes are fast, but you also have network latency for communicating with all the nodes. What would happen in the worst-of-N case for multi-datacenter deployments? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080115#comment-13080115 ] Patricio Echague commented on CASSANDRA-2034: - {quote} return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets {quote} The way I'm planning on implementing this last part is by adding a maybeWriteHint in the hook that already exist for the ExpiringMap in MessagingService. The only problem is that I don't have the mutation as to generate a hint in MessageService. I might catch what nodes hasn't replied in the StorageProxy and store one mutation per node that hasn't replied into an ExpiringMap in MessagingService. The down side is that for CL.ONE I will end up storing basically one mutation per replica minus the one that responded, and it may lead to memory issues. Thoughts? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080129#comment-13080129 ] T Jake Luciani commented on CASSANDRA-2034: --- I think you can reuse the same RowMutation object across messages so shouldn't cause duplicates in memory. The only issue is if too many mutations queue up you might OOM but this is the same problem we currently we have with the write stage. So if you use a Expiring map you should add a onExpiration hook to write the hint locally for the replicas that never responded to the Mutation. This covers the case that mutations expire before a response is received. Then all the MessageTask needs todo is clear the messageId from the expiring map before the expiration time. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080137#comment-13080137 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. The only issue is if too many mutations queue up you might OOM but this is the same problem we currently we have with the write stage Right. Either way worst case is already you need to be able to buffer up to rpc_timeout's worth of writes in memory. bq. if you use a Expiring map you should add a onExpiration hook to write the hint locally for the replicas that never responded to the Mutation I'm not sure you want a separate ExpiringMap from the one you already have in MS. Might make for weird corner cases. But that's an implementation detail; the approach is sound. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080155#comment-13080155 ] Patricio Echague commented on CASSANDRA-2034: - The reason I need the extra map is to store the mutation. Otherwise I cannot generate a hint. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080203#comment-13080203 ] Patricio Echague commented on CASSANDRA-2034: - {quote} But, you are guaranteed that successful writes have been hinted (if necessary) so you do not have to repair unless there is hardware permadeath. (Otherwise you would have to repair after power failure or crashes, too.) {quote} Since we are queuing up a hint on failure/RPC timeout after acknowledging to the client, it looks like we need to rapair everytime a node is shutdown given the crash-only way to shutting down Cassandra. Since we can have task in the queue yet to be executed. Right? I don't think repair is need only on hardware permadeath. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080207#comment-13080207 ] T Jake Luciani commented on CASSANDRA-2034: --- There is a shutdownHook we added for these kinds of things. see StorageService Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080208#comment-13080208 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. I don't think repair is need only on hardware permadeath. It's right there in what you quoted: Otherwise [if you're not waiting for all replica acks before returning to client] you would have to repair after power failure or crashes, too. bq. it looks like we need to rapair everytime a node is shutdown given the crash-only way to shutting down Cassandra As discussed on chat, you need to add a shutdown hook to let the executor finish for non-crash shutdowns. Look for Runtime.getRuntime().addShutdownHook in StorageService. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080213#comment-13080213 ] Patricio Echague commented on CASSANDRA-2034: - Thanks. As per the TimeoutException trying to queue up a hint while writing hints after ACK timeout. Would it be ok to have a second executor with a non-capped queue? I'm currently using a capped queue for writing hints to replicas that are down before starting the mutation. Downside of this is that the queue can grow big (by default we don't schedule hints after one hour that a replica has been down) Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13079618#comment-13079618 ] Jonathan Ellis commented on CASSANDRA-2034: --- So I proposed two mutually exclusive approaches here: - return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets - add a separate executor here, with a blocking, capped queue. When we go to do a hint-after-failure we enqueue [and] wait for the write and then return success to the client The difference can be summarized as: do we wait for all hints to be written before returning to the client? If you do, then CL.ONE write latency becomes worst-of-N instead of best-of-N. But, you are guaranteed that successful writes have been hinted (if necessary) so you do not have to repair unless there is hardware permadeath. (Otherwise you would have to repair after power failure or crashes, too.) I'm inclined to think that the first option is better, partly because writes are *fast* so worst-of-N really isn't that different from best-of-N. Also, we could use CASSANDRA-2819 to reduce the default write timeout while still being conservative for reads (which might hit disk). Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: 2034-formatting.txt, CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk-v5.patch, CASSANDRA-2034-trunk-v6.patch, CASSANDRA-2034-trunk-v7.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078928#comment-13078928 ] Patricio Echague commented on CASSANDRA-2034: - This ticket assumes counter will not be written with CL == ANY (CASSANDRA-2990) Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078935#comment-13078935 ] Jonathan Ellis commented on CASSANDRA-2034: --- I think CreationTimeAwareFuture is missing? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk-v3.patch, CASSANDRA-2034-trunk-v4.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073791#comment-13073791 ] Jonathan Ellis commented on CASSANDRA-2034: --- Looks reasonable. I'd move the wait method into FBUtilities as an overload of waitOnFutures. Does the timeout on get start when the future is created, or when get is called? I think it is the latter but I am not sure. If so, we should track the total time waited and reduce after each get() so we do not allow total of up to timeout * hints. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073807#comment-13073807 ] Patricio Echague commented on CASSANDRA-2034: - {quote} Does the timeout on get start when the future is created, or when get is called? I think it is the latter but I am not sure. If so, we should track the total time waited and reduce after each get() so we do not allow total of up to timeout * hints. {quote} Yeah, I need to add a creation time and so something similar to what IAsynResult does. I noticed I missed to skip the hints creation when HH is disabled. Some thoughts on this I would like some feedback: Note: remember that hints are written locally on the coordinator node now. | Hinted Handoff | Consist. Level | | on | =1 | -- wait for hints. We DO NOT notify the handler with handler.response() for hints; | on | ANY | -- wait for hints. Responses count towards consistency. | off| =1 | -- DO NOT fire hints. And DO NOT wait for them to complete. | off| ANY | -- Fire hints but don't wait for them. They count towards consistency. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073812#comment-13073812 ] Jonathan Ellis commented on CASSANDRA-2034: --- Looks good to me for the first 3. I think ANY should be equal to ONE for hints=off. I.e., when it's off we *never* create hints. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13075977#comment-13075977 ] Patricio Echague commented on CASSANDRA-2034: - v2 patch replaces v1. Changes in v2: - It fixes the add-up timeouts by adding a CreationAwareFuture - Implements the matrix previously discussed | Hinted Handoff | Consist. Level | | on | =1 | -- wait for hints. We DO NOT notify the handler with handler.response() for hints; | on | ANY | -- wait for hints. Responses count towards consistency. | off| =1 | -- DO NOT fire hints. And DO NOT wait for them to complete. | off| ANY | -- DO NOT fire hints. And DO NOT wait for them to complete. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Attachments: CASSANDRA-2034-trunk-v2.patch, CASSANDRA-2034-trunk.patch Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13069275#comment-13069275 ] Patricio Echague commented on CASSANDRA-2034: - bq. after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. CASSANDRA-2914 handles the local storage of hints. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067059#comment-13067059 ] Nicholas Telford commented on CASSANDRA-2034: - .bq Don't hints timeout? Yes, the timeout is set to the GCGraceSeconds of the CF the hint is for. This is to prevent deletes from being undone by an old hint being replayed. Post-#2045 hints contain the RM for multiple CFs; as such, the TTL for a hint is the minimum GCGraceSeconds from all of the CFs it references. This could cause hints to expire before delivery if one of the CFs for a mutation has a particularly short GCGraceSeconds. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067106#comment-13067106 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. the timeout is set to the GCGraceSeconds of the CF the hint is for That's right. So the guidance we can give is, you need to run repair if a node is down for longer than GCGraceSeconds, or if you lose data because of a hardware problem. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058044#comment-13058044 ] Patricio Echague commented on CASSANDRA-2034: - Depends on CASSANDRA-2045. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056910#comment-13056910 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. If we had different timeouts for writes than reads then it might be nice to use say 80% of the timeout for the normal write, and reserve 20% for the hint phase Different r/w timeouts is being added in CASSANDRA-2819. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Patricio Echague Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039181#comment-13039181 ] Jonathan Ellis commented on CASSANDRA-2034: --- Note that hint writes MUST be synchronous w/ the writes-as-percieved-by-client (which the design above accomplishes) or else you lose hints silently if a coordinator crashes (or is shut down/killed normally). Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Sylvain Lebresne Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033147#comment-13033147 ] T Jake Luciani commented on CASSANDRA-2034: --- Don't hints timeout? would there be a chance of never resolving the discrepancy with this approach? Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033159#comment-13033159 ] Stu Hood commented on CASSANDRA-2034: - bq. Better would be to add a hook to messagingservice callback expiration, and fire hint recording from there bq. So I think what we want to do, with this option on, is to attempt the hint write but if we can't do it in a reasonable time... +1. An expiration handler on the messaging service queue, and throttled local hint writing should work very well. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033202#comment-13033202 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. Don't hints timeout? No (but cleanup can purge hinted rows, so don't do that unless all hints have been replayed). bq. would there be a chance of never resolving the discrepancy with this approach? Definitely in the case of a node went down, so I wrote some hints, but then the hinted node lost its hdd too. In that case you'd need to run AE repair. So the idea is not to make AE repair (completely) obsolete, only RR. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033273#comment-13033273 ] Stu Hood commented on CASSANDRA-2034: - IMO, disabling RR entirely is never a good idea unless we are going to _guarantee_ hint delivery. But I agree that this ticket is a good idea because increasing the probability of hint delivery is healthy. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032696#comment-13032696 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. there is the potential to take us back to the Bad Old Days when HH could cause cascading failure To elaborate, the scenario here is, we did a write that succeeded on some nodes, but not others. So we need to write a local hint to replay to the down-or-slow nodes later. But, those nodes being down-or-slow mean load has increased on the rest of the cluster, and writing the extra hint will increase that further, possibly enough that other nodes will see this coordinator as down-or-slow, too, and so on. So I think what we want to do, with this option on, is to attempt the hint write but if we can't do it in a reasonable time, throw back a TimedOutException which is already our signal that your cluster may be overloaded, you need to back off. Specifically, we could add a separate executor here, with a blocking, capped queue. When we go to do a hint-after-failure we enqueue the write but if it is rejected because queue is full we throw the TOE. Otherwise, we wait for the write and then return success to the client. The tricky part is the queue needs to be large enough to handle load spikes but small enough that wait-for-success-post-enqueue is negligible compared to RpcTimeout. If we had different timeouts for writes than reads (which we don't -- CASSANDRA-959) then it might be nice to use say 80% of the timeout for the normal write, and reserve 20% for the hint phase. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027151#comment-13027151 ] Jonathan Ellis commented on CASSANDRA-2034: --- bq. add a scheduled task this is the wrong approach, as we found out when we tried something similar for read repair, which we fixed in CASSANDRA-2069. Better would be to add a hook to messagingservice callback expiration, and fire hint recording from there if MS expires the callback before all acks are received. (We could refactor the dynamic snitch latency update into a similar hook for reads.) bq. This would need a separate executor for local writes that doesn't drop writes when it's behind I'm more worried about this; there is the potential to take us back to the Bad Old Days when HH could cause cascading failure. (Of course the right answer is, Don't run your cluster so close to the edge of capacity, but we still want to degrade gracefully when this is ignored.) Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Fix For: 1.0 Original Estimate: 8h Remaining Estimate: 8h Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-2034) Make Read Repair unnecessary when Hinted Handoff is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12988173#action_12988173 ] Jonathan Ellis commented on CASSANDRA-2034: --- This would need a separate executor for local writes that doesn't drop writes when it's behind (and blocks when it's full) to avoid problems in overcapacity situations. I'm about halfway convinced that while blocking clients for writes to {any node in the cluster} is bad to control overload situations, blocking clients when the coordinator itself is overloaded is ok. Make Read Repair unnecessary when Hinted Handoff is enabled --- Key: CASSANDRA-2034 URL: https://issues.apache.org/jira/browse/CASSANDRA-2034 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Fix For: 0.7.2 Currently, HH is purely an optimization -- if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does. Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.