[ https://issues.apache.org/jira/browse/CASSANDRA-13810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237327#comment-16237327 ]
Tom van der Woerdt commented on CASSANDRA-13810: ------------------------------------------------ It's still happening, but we (mostly) worked around it by lowering hinted_handoff_throttle_in_kb 100x. :) > Overload because of hint pressure + MVs > --------------------------------------- > > Key: CASSANDRA-13810 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13810 > Project: Cassandra > Issue Type: Bug > Components: Materialized Views > Reporter: Tom van der Woerdt > Priority: Major > Labels: materializedviews > > Cluster setup: 3 DCs, 20 Cassandra nodes each, all 3.0.14, with approx. 200GB > data per machine. Many tables have MVs associated. > During some maintenance we did a rolling restart of all nodes in the cluster. > This caused a buildup of hints/batches, as expected. Most nodes came back > just fine, except for two nodes. > These two nodes came back with a loadavg of >100, and 'nodetool tpstats' > showed a million (not exaggerating) MutationStage tasks per second(!). It was > clear that these were mostly (all?) mutations coming from hints, as indicated > by thousands of log entries per second in debug.log : > {noformat} > DEBUG [SharedPool-Worker-107] 2017-08-27 13:16:51,098 HintVerbHandler.java:95 > - Failed to apply hint > java.util.concurrent.CompletionException: > org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - > received only 0 responses. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > ~[na:1.8.0_144] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > ~[na:1.8.0_144] > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:647) > ~[na:1.8.0_144] > at > java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632) > ~[na:1.8.0_144] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > ~[na:1.8.0_144] > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > ~[na:1.8.0_144] > at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:481) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > org.apache.cassandra.db.Keyspace.lambda$applyInternal$0(Keyspace.java:495) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_144] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) > ~[apache-cassandra-3.0.14.jar:3.0.14] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144] > Caused by: org.apache.cassandra.exceptions.WriteTimeoutException: Operation > timed out - received only 0 responses. > ... 6 common frames omitted > {noformat} > After reading the relevant code, it seems that a hint is considered > droppable, and in the mutation path when the table contains a MV and the lock > fails to acquire and the mutation is droppable, it throws a WTE without > waiting until the timeout expires. This explains why Cassandra is able to > process a million mutations per second without actually considering them > 'dropped' in the 'nodetool tpstats' output. > I managed to recover the two nodes by stopping handoffs on all nodes in the > cluster and reenabling them one at a time. It's likely that the hint/batchlog > settings were sub-optimal on this cluster, but I think that the retry > behavior(?) of hints should be improved as it's hard to express hint > throughput in kb/s when the mutations can involve MVs. > More data available upon request -- I'm not sure which bits are relevant and > which aren't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org