[
https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tyler Hobbs updated CASSANDRA-12689:
------------------------------------
Description:
Under heavy load (e.g. due to repair during normal operations), a lot of
NullPointerExceptions occur in MutationStage. Unfortunately, the log is not
very chatty, trace is missing:
{noformat}
2016-09-22T06:29:47+00:00 cas6 [MutationStage-1]
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught
exception on thread Thread[MutationStage-1,5,main]: {}
2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
{noformat}
Then, after some time, in most cases ALL threads in MutationStage pools are
completely blocked. This leads to piling up pending tasks until server runs OOM
and is completely unresponsive due to GC. Threads will NEVER unblock until
server restart. Even if load goes completely down, all hints are paused, and no
compaction or repair is running. Only restart helps.
I can understand that pending tasks in MutationStage may pile up under heavy
load, but tasks should be processed and dequeud after load goes down. This is
definitively not the case. This looks more like a an unhandled exception
leading to a stuck lock.
Stack trace from jconsole, all Threads in MutationStage show same trace.
{noformat}
Name: MutationStage-48
State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
Total blocked: 137 Total waited: 138.513
{noformat}
Stack trace:
{noformat}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
org.apache.cassandra.hints.Hint.apply(Hint.java:96)
org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
java.lang.Thread.run(Thread.java:745)
{noformat}
was:
Under heavy load (e.g. due to repair during normal operations), a lot of
NullPointerExceptions occur in MutationStage. Unfortunately, the log is not
very chatty, trace is missing:
2016-09-22T06:29:47+00:00 cas6 [MutationStage-1]
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught
exception on thread Thread[MutationStage-1,5,main]: {}
2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
Then, after some time, in most cases ALL threads in MutationStage pools are
completely blocked. This leads to piling up pending tasks until server runs OOM
and is completely unresponsive due to GC. Threads will NEVER unblock until
server restart. Even if load goes completely down, all hints are paused, and no
compaction or repair is running. Only restart helps.
I can understand that pending tasks in MutationStage may pile up under heavy
load, but tasks should be processed and dequeud after load goes down. This is
definitively not the case. This looks more like a an unhandled exception
leading to a stuck lock.
Stack trace from jconsole, all Threads in MutationStage show same trace.
Name: MutationStage-48
State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
Total blocked: 137 Total waited: 138.513
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
org.apache.cassandra.hints.Hint.apply(Hint.java:96)
org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
java.lang.Thread.run(Thread.java:745)
> All MutationStage threads blocked, kills server
> -----------------------------------------------
>
> Key: CASSANDRA-12689
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12689
> Project: Cassandra
> Issue Type: Bug
> Components: Local Write-Read Paths
> Reporter: Benjamin Roth
> Priority: Critical
>
> Under heavy load (e.g. due to repair during normal operations), a lot of
> NullPointerExceptions occur in MutationStage. Unfortunately, the log is not
> very chatty, trace is missing:
> {noformat}
> 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1]
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught
> exception on thread Thread[MutationStage-1,5,main]: {}
> 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
> {noformat}
> Then, after some time, in most cases ALL threads in MutationStage pools are
> completely blocked. This leads to piling up pending tasks until server runs
> OOM and is completely unresponsive due to GC. Threads will NEVER unblock
> until server restart. Even if load goes completely down, all hints are
> paused, and no compaction or repair is running. Only restart helps.
> I can understand that pending tasks in MutationStage may pile up under heavy
> load, but tasks should be processed and dequeud after load goes down. This is
> definitively not the case. This looks more like a an unhandled exception
> leading to a stuck lock.
> Stack trace from jconsole, all Threads in MutationStage show same trace.
> {noformat}
> Name: MutationStage-48
> State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
> Total blocked: 137 Total waited: 138.513
> {noformat}
> Stack trace:
> {noformat}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
> org.apache.cassandra.hints.Hint.apply(Hint.java:96)
> org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
> java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)