[ 
https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519287#comment-15519287
 ] 

Benjamin Roth commented on CASSANDRA-12689:
-------------------------------------------

I guess I found the cause of the problem (at least I'm pretty sure).
There is a race condition in calling Keyspace.apply in a blocking manner from 
Mutation.apply. See Mutation line 227, 
Uninterruptibles.getUninterruptibly(applyFuture(durableWrites))

When this is called AND the lock for MV update could not be aquired, THEN the 
apply is being deferred on MutationStage queue and Mutation.apply is waiting 
for this deferred task to finish, right? So this Thread (MutationStageWorker) 
is blocked until the deferred future is completed and cannot process any other 
tasks in the mutation queue.
But what if all mutation workers are currently busy and in the same situation? 
Then the deferred tasks will never be processed, the futures will never 
complete and all workers are waiting for their futures to be completed which 
will never happen => Complete DEADLOCK.

More abstract: A blocking call in any stage MUST NEVER defer itself on its own 
stage.
Simple example: Imagine a queue with 1 worker. That worker is processing a task 
of this queue. This task enqueues another task on the same queue and wait for 
it to finish. It never will, as there is only one worker and this one is now 
blocked.

Possible solutions:
1. complete future before defer. Would resolve that special issue but that 
would mean "fire and forget" and that is not what futures are made for
2. Do not block in Mutation.apply or use Mutation.applyFuture in critical 
situations - probably fine solution but harder to implement and big impact on 
existing code
3. My personally preferred option: Introduce "deferrable" flag in 
Keyspace.apply and set it to false when called from a blocking context. If true 
then dont defer current apply but retry in loop until success or writeTimout is 
reached, maybe with a small sleep time depending on writeTimeout (e.g. 
writeTimeout / 100).

Apart from all that:
If a caller is waiting (blocking) for a future to finish it absolutely makes no 
sense to defer it to be processed by another thread in the future for not to 
block the current thread (see comment on line 492 in Keyspace). The caller 
thread is blocked anyway by waiting for the future to complete.

Does someone agree?

> All MutationStage threads blocked, kills server
> -----------------------------------------------
>
>                 Key: CASSANDRA-12689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12689
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Benjamin Roth
>            Priority: Critical
>
> Under heavy load (e.g. due to repair during normal operations), a lot of 
> NullPointerExceptions occur in MutationStage. Unfortunately, the log is not 
> very chatty, trace is missing:
> 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught 
> exception on thread Thread[MutationStage-1,5,main]: {}
> 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
> Then, after some time, in most cases ALL threads in MutationStage pools are 
> completely blocked. This leads to piling up pending tasks until server runs 
> OOM and is completely unresponsive due to GC. Threads will NEVER unblock 
> until server restart. Even if load goes completely down, all hints are 
> paused, and no compaction or repair is running. Only restart helps.
> I can understand that pending tasks in MutationStage may pile up under heavy 
> load, but tasks should be processed and dequeud after load goes down. This is 
> definitively not the case. This looks more like a an unhandled exception 
> leading to a stuck lock.
> Stack trace from jconsole, all Threads in MutationStage show same trace.
> Name: MutationStage-48
> State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
> Total blocked: 137  Total waited: 138.513
> Stack trace: 
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
> org.apache.cassandra.hints.Hint.apply(Hint.java:96)
> org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
> java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to