[ 
https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529034#comment-15529034
 ] 

Benjamin Roth commented on CASSANDRA-12689:
-------------------------------------------

I was able to tack that bug down and prove it with a negative dtest and was 
able to prove my fix with a positive dtest. 
(https://gist.github.com/brstgt/339d20994828794c8f374bc987b7b6d7)

To be able to run that tests I had to do some hacks 
(https://github.com/Jaumo/cassandra/commit/6b6806b9ba60c9b7111f00451aec4c6182199702)
 so that there is only a single mutation worker and to fail to aquire MV locks 
in a determined order. The test is not deterministic if there is more than 1 
worker, because then race conditions will pop up.

>From my point of view there is a solid proof of my theory. Please apply my 
>patch. I deployed it already on our production system and it also seems to 
>work there - at least there were no more deadlocks under high load.

Ah, btw: That bug is as old as MVs are.

If there are questions, please get back to me.

> All MutationStage threads blocked, kills server
> -----------------------------------------------
>
>                 Key: CASSANDRA-12689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12689
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Benjamin Roth
>            Priority: Critical
>
> Under heavy load (e.g. due to repair during normal operations), a lot of 
> NullPointerExceptions occur in MutationStage. Unfortunately, the log is not 
> very chatty, trace is missing:
> 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught 
> exception on thread Thread[MutationStage-1,5,main]: {}
> 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
> Then, after some time, in most cases ALL threads in MutationStage pools are 
> completely blocked. This leads to piling up pending tasks until server runs 
> OOM and is completely unresponsive due to GC. Threads will NEVER unblock 
> until server restart. Even if load goes completely down, all hints are 
> paused, and no compaction or repair is running. Only restart helps.
> I can understand that pending tasks in MutationStage may pile up under heavy 
> load, but tasks should be processed and dequeud after load goes down. This is 
> definitively not the case. This looks more like a an unhandled exception 
> leading to a stuck lock.
> Stack trace from jconsole, all Threads in MutationStage show same trace.
> Name: MutationStage-48
> State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
> Total blocked: 137  Total waited: 138.513
> Stack trace: 
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
> org.apache.cassandra.hints.Hint.apply(Hint.java:96)
> org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
> java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to