Re: Uncaught exception on thread CounterMutationStage

David Salz Thu, 27 Jul 2017 07:42:10 -0700

Hi Jeff,

thanks for the pointers!


We upgraded to C* 3.11.0 now and the situation has improved a little
bit, the node does not die completely any more, but the
WriteTimeoutExceptions persists and still 'freeze' the node for a couple
of minutes.


> A single node with 20 cores and 256GB of RAM is probably not going to
> be the best choice - while it's a great machine, the default cassandra
> config really isn't tuned for that # of cores or that much RAM (it'll
> almost all be left for page cache, which is great for reads, and less
> great for write heavy workloads). What sort of heap settings are you
> using? 

-ea
-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:StringTableSize=1000003
-XX:+AlwaysPreTouch
-XX:-UseBiasedLocking
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:+UseNUMA
-XX:+PerfDisableSharedMem
-Djava.net.preferIPv4Stack=true
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=700
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M
-Xms98304M
-Xmx98304M

GC does not seem to be the issue, seeing GC runs every 30 seconds and
they usually finish well below the 700ms limit. Will enable GC log file
though, don't have that right now.

> You're getting timeouts on a single node cluster, which usually means you're 
> in a GC spin a thread deadlocked or a thread pool backed up or similar. 
> Seeing 'nodetool tpstats' may be a starting point. Knowing whether the node 
> stops processing all data at this time, or just some of it, would also help. 
> You'd want to take a look for indications of a GC pause (GCInspector log 
> lines, or even better actual GC logs), and if that doesn't work, jstack 
> output thrown onto pastebin or gist or similar.
>
Good point. Checked tpstats and found a high number (millions) of
all-time blocked Native-Transport-Request. Googled a bit and now set

-Dcassandra.max_queued_native_transport_requests=4096

and

native_transport_max_threads=4096

Seeing no more blocked NTRs so far. Do you think this could have
contributed to the problem? The default values seemed way too small for
our load and our machine at any rate.
Again, thanks for the help so far!

David




-- 
-----------------------------------
Technical Director / Co-Founder
Sandbox Interactive GmbH
http://albiononline.com



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Uncaught exception on thread CounterMutationStage

Reply via email to