[ 
https://issues.apache.org/jira/browse/CASSANDRA-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041786#comment-15041786
 ] 

Ariel Weisberg commented on CASSANDRA-10477:
--------------------------------------------

bq. We're kind of dodging the hint "overload" protection on the paxos path as 
we don't use sendToHintedEndpoints (which in particular makes the comment on 
commitPaxosLocal misleading as it suggests otherwise). I think the simplest 
solution is to move the overload test from sendToHintedEndpoints to some 
checkOverloaded() method and call that in commitPaxos too.
Which aspect of hint "overload" protection is missing? [I see it increments a 
counter which I thought was the signal 
upstream.|https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/service/StorageProxy.java#L976]

Looking at it further is it because it doesn't throw {{OverloadedException}}? 
So a better behavior would be to have the check and exception in a helper 
method and use that in commitPaxos() so that it can now throw 
{{OverloadedException}}?

I do wonder what the unforeseen consequences of having {{CAS}} capable of 
throwing {{OE}} is going to do that we haven't seen or tested before. Where 
this gets interesting is that the read path now throws {{OE}} where it didn't 
before because apparently serial consistency reads can end up calling 
{{beginAndRepairPaxos}}. I need to take a close look at how we test this path 
to make sure it's going to behave well once exercised.

bq. In theory, we could still run into the problem of that ticket if 
OPTIMIZE_LOCAL_REQUESTS is false. And in fact, I believe this option is unsafe 
since at least CASSANDRA-4753 as we somewhat strongly assume writes to the 
localhost do not go through MessagingService. So I would suggest ditching that 
option. Not only is it unsafe, but it's not used anywhere by the code and it's 
hardcoded so you have to change the code and recompile to even use it (which 
means I doubt anyone has even tried it in a long long time). And if we end up 
needing it in the future, we'll have to figure out how to make it safe.
It's already removed from 2.2. Yeah I don't think anyone uses it.

bq. Why isn't the added assertion in WriteCallbackInfo on 3.0 not using 
!shouldHint lie in the 2.1 patch?
It's an oversight from merging.

> java.lang.AssertionError in StorageProxy.submitHint
> ---------------------------------------------------
>
>                 Key: CASSANDRA-10477
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10477
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>         Environment: CentOS 6, Oracle JVM 1.8.45
>            Reporter: Severin Leonhardt
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>
> A few days after updating from 2.0.15 to 2.1.9 we have the following log 
> entry on 2 of 5 machines:
> {noformat}
> ERROR [EXPIRING-MAP-REAPER:1] 2015-10-07 17:01:08,041 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[EXPIRING-MAP-REAPER:1,5,main]
> java.lang.AssertionError: /192.168.11.88
>         at 
> org.apache.cassandra.service.StorageProxy.submitHint(StorageProxy.java:949) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:383) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:363) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.utils.ExpiringMap$1.run(ExpiringMap.java:98) 
> ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
>  ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_45]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
> [na:1.8.0_45]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  [na:1.8.0_45]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  [na:1.8.0_45]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_45]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_45]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
> {noformat}
> 192.168.11.88 is the broadcast address of the local machine.
> When this is logged the read request latency of the whole cluster becomes 
> very bad, from 6 ms/op to more than 100 ms/op according to OpsCenter. Clients 
> get a lot of timeouts. We need to restart the affected Cassandra node to get 
> back normal read latencies. It seems write latency is not affected.
> Disabling hinted handoff using {{nodetool disablehandoff}} only prevents the 
> assert from being logged. At some point the read latency becomes bad again. 
> Restarting the node where hinted handoff was disabled results in the read 
> latency being better again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to