[ https://issues.apache.org/jira/browse/CASSANDRA-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041786#comment-15041786 ]
Ariel Weisberg commented on CASSANDRA-10477: -------------------------------------------- bq. We're kind of dodging the hint "overload" protection on the paxos path as we don't use sendToHintedEndpoints (which in particular makes the comment on commitPaxosLocal misleading as it suggests otherwise). I think the simplest solution is to move the overload test from sendToHintedEndpoints to some checkOverloaded() method and call that in commitPaxos too. Which aspect of hint "overload" protection is missing? [I see it increments a counter which I thought was the signal upstream.|https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/service/StorageProxy.java#L976] Looking at it further is it because it doesn't throw {{OverloadedException}}? So a better behavior would be to have the check and exception in a helper method and use that in commitPaxos() so that it can now throw {{OverloadedException}}? I do wonder what the unforeseen consequences of having {{CAS}} capable of throwing {{OE}} is going to do that we haven't seen or tested before. Where this gets interesting is that the read path now throws {{OE}} where it didn't before because apparently serial consistency reads can end up calling {{beginAndRepairPaxos}}. I need to take a close look at how we test this path to make sure it's going to behave well once exercised. bq. In theory, we could still run into the problem of that ticket if OPTIMIZE_LOCAL_REQUESTS is false. And in fact, I believe this option is unsafe since at least CASSANDRA-4753 as we somewhat strongly assume writes to the localhost do not go through MessagingService. So I would suggest ditching that option. Not only is it unsafe, but it's not used anywhere by the code and it's hardcoded so you have to change the code and recompile to even use it (which means I doubt anyone has even tried it in a long long time). And if we end up needing it in the future, we'll have to figure out how to make it safe. It's already removed from 2.2. Yeah I don't think anyone uses it. bq. Why isn't the added assertion in WriteCallbackInfo on 3.0 not using !shouldHint lie in the 2.1 patch? It's an oversight from merging. > java.lang.AssertionError in StorageProxy.submitHint > --------------------------------------------------- > > Key: CASSANDRA-10477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10477 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths > Environment: CentOS 6, Oracle JVM 1.8.45 > Reporter: Severin Leonhardt > Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > A few days after updating from 2.0.15 to 2.1.9 we have the following log > entry on 2 of 5 machines: > {noformat} > ERROR [EXPIRING-MAP-REAPER:1] 2015-10-07 17:01:08,041 > CassandraDaemon.java:223 - Exception in thread > Thread[EXPIRING-MAP-REAPER:1,5,main] > java.lang.AssertionError: /192.168.11.88 > at > org.apache.cassandra.service.StorageProxy.submitHint(StorageProxy.java:949) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:383) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:363) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at org.apache.cassandra.utils.ExpiringMap$1.run(ExpiringMap.java:98) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_45] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > [na:1.8.0_45] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > [na:1.8.0_45] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > [na:1.8.0_45] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_45] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_45] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45] > {noformat} > 192.168.11.88 is the broadcast address of the local machine. > When this is logged the read request latency of the whole cluster becomes > very bad, from 6 ms/op to more than 100 ms/op according to OpsCenter. Clients > get a lot of timeouts. We need to restart the affected Cassandra node to get > back normal read latencies. It seems write latency is not affected. > Disabling hinted handoff using {{nodetool disablehandoff}} only prevents the > assert from being logged. At some point the read latency becomes bad again. > Restarting the node where hinted handoff was disabled results in the read > latency being better again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)