[
https://issues.apache.org/jira/browse/CASSANDRA-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Shuler resolved CASSANDRA-5218.
---------------------------------------
Resolution: Not a Problem
Closing as not a problem. We indeed switched to logback in 2.1.
> Log explosion when another cluster node is down and remaining node is
> overloaded.
> ---------------------------------------------------------------------------------
>
> Key: CASSANDRA-5218
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5218
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 1.1.7
> Reporter: Sergey Olefir
> Priority: Minor
>
> I have Cassandra 1.1.7 cluster with 4 nodes in 2 datacenters (2+2).
> Replication is configured as DC1:2,DC2:2 (i.e. every node holds the entire
> data).
> I am load-testing counter increments at the rate of about 10k per second. All
> writes are directed to two nodes in DC1 (DC2 nodes are basically backup). In
> total there's 100 separate clients executing 1-2 batch updates per second.
> We wanted to test what happens if one node goes down, so we brought one node
> down in DC1 (i.e. the node that was handling half of the incoming writes).
> This led to a complete explosion of logs on the remaining alive node in DC1.
> There are hundreds of megabytes of logs within an hour all basically saying
> the same thing:
> ERROR [ReplicateOnWriteStage:5653390] 2013-01-22 12:44:33,611
> AbstractCassandraDaemon.java (line 135) Exception in thread
> Thread[ReplicateOnWriteStage:5653390,5,main]
> java.lang.RuntimeException: java.util.concurrent.TimeoutException
> at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1275)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.TimeoutException
> at
> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:311)
>
> at
> org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.java:585)
>
> at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1271)
>
> ... 3 more
> The logs are completely swamped with this and are thus unusable. It may also
> negatively impact the node performance.
> According to Aaron Morton:
> {quote}The error is the coordinator node protecting it's self.
> Basically it cannot handle the volume of local writes + the writes for HH.
> The number of in flight hints is greater than…
> private static volatile int maxHintsInProgress = 1024 *
> Runtime.getRuntime().availableProcessors();{quote}
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584957.html
> I think there are two issues here:
> (a) the same exception occurring for the same reason doesn't need to be
> spammed into log many times per second;
> (b) exception message ought to be more clear about cause -- i.e. in this case
> some message about "overload" or "load shedding" might be appropriate.
--
This message was sent by Atlassian JIRA
(v6.2#6252)