[
https://issues.apache.org/jira/browse/CASSANDRA-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041786#comment-15041786
]
Ariel Weisberg commented on CASSANDRA-10477:
--------------------------------------------
bq. We're kind of dodging the hint "overload" protection on the paxos path as
we don't use sendToHintedEndpoints (which in particular makes the comment on
commitPaxosLocal misleading as it suggests otherwise). I think the simplest
solution is to move the overload test from sendToHintedEndpoints to some
checkOverloaded() method and call that in commitPaxos too.
Which aspect of hint "overload" protection is missing? [I see it increments a
counter which I thought was the signal
upstream.|https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/service/StorageProxy.java#L976]
Looking at it further is it because it doesn't throw {{OverloadedException}}?
So a better behavior would be to have the check and exception in a helper
method and use that in commitPaxos() so that it can now throw
{{OverloadedException}}?
I do wonder what the unforeseen consequences of having {{CAS}} capable of
throwing {{OE}} is going to do that we haven't seen or tested before. Where
this gets interesting is that the read path now throws {{OE}} where it didn't
before because apparently serial consistency reads can end up calling
{{beginAndRepairPaxos}}. I need to take a close look at how we test this path
to make sure it's going to behave well once exercised.
bq. In theory, we could still run into the problem of that ticket if
OPTIMIZE_LOCAL_REQUESTS is false. And in fact, I believe this option is unsafe
since at least CASSANDRA-4753 as we somewhat strongly assume writes to the
localhost do not go through MessagingService. So I would suggest ditching that
option. Not only is it unsafe, but it's not used anywhere by the code and it's
hardcoded so you have to change the code and recompile to even use it (which
means I doubt anyone has even tried it in a long long time). And if we end up
needing it in the future, we'll have to figure out how to make it safe.
It's already removed from 2.2. Yeah I don't think anyone uses it.
bq. Why isn't the added assertion in WriteCallbackInfo on 3.0 not using
!shouldHint lie in the 2.1 patch?
It's an oversight from merging.
> java.lang.AssertionError in StorageProxy.submitHint
> ---------------------------------------------------
>
> Key: CASSANDRA-10477
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10477
> Project: Cassandra
> Issue Type: Bug
> Components: Local Write-Read Paths
> Environment: CentOS 6, Oracle JVM 1.8.45
> Reporter: Severin Leonhardt
> Assignee: Ariel Weisberg
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>
> A few days after updating from 2.0.15 to 2.1.9 we have the following log
> entry on 2 of 5 machines:
> {noformat}
> ERROR [EXPIRING-MAP-REAPER:1] 2015-10-07 17:01:08,041
> CassandraDaemon.java:223 - Exception in thread
> Thread[EXPIRING-MAP-REAPER:1,5,main]
> java.lang.AssertionError: /192.168.11.88
> at
> org.apache.cassandra.service.StorageProxy.submitHint(StorageProxy.java:949)
> ~[apache-cassandra-2.1.9.jar:2.1.9]
> at
> org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:383)
> ~[apache-cassandra-2.1.9.jar:2.1.9]
> at
> org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:363)
> ~[apache-cassandra-2.1.9.jar:2.1.9]
> at org.apache.cassandra.utils.ExpiringMap$1.run(ExpiringMap.java:98)
> ~[apache-cassandra-2.1.9.jar:2.1.9]
> at
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
> ~[apache-cassandra-2.1.9.jar:2.1.9]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_45]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [na:1.8.0_45]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_45]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> [na:1.8.0_45]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_45]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_45]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
> {noformat}
> 192.168.11.88 is the broadcast address of the local machine.
> When this is logged the read request latency of the whole cluster becomes
> very bad, from 6 ms/op to more than 100 ms/op according to OpsCenter. Clients
> get a lot of timeouts. We need to restart the affected Cassandra node to get
> back normal read latencies. It seems write latency is not affected.
> Disabling hinted handoff using {{nodetool disablehandoff}} only prevents the
> assert from being logged. At some point the read latency becomes bad again.
> Restarting the node where hinted handoff was disabled results in the read
> latency being better again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)