Hi Jeff, thanks for the pointers!
We upgraded to C* 3.11.0 now and the situation has improved a little bit, the node does not die completely any more, but the WriteTimeoutExceptions persists and still 'freeze' the node for a couple of minutes. > A single node with 20 cores and 256GB of RAM is probably not going to > be the best choice - while it's a great machine, the default cassandra > config really isn't tuned for that # of cores or that much RAM (it'll > almost all be left for page cache, which is great for reads, and less > great for write heavy workloads). What sort of heap settings are you > using? -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=700 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Xms98304M -Xmx98304M GC does not seem to be the issue, seeing GC runs every 30 seconds and they usually finish well below the 700ms limit. Will enable GC log file though, don't have that right now. > You're getting timeouts on a single node cluster, which usually means you're > in a GC spin a thread deadlocked or a thread pool backed up or similar. > Seeing 'nodetool tpstats' may be a starting point. Knowing whether the node > stops processing all data at this time, or just some of it, would also help. > You'd want to take a look for indications of a GC pause (GCInspector log > lines, or even better actual GC logs), and if that doesn't work, jstack > output thrown onto pastebin or gist or similar. > Good point. Checked tpstats and found a high number (millions) of all-time blocked Native-Transport-Request. Googled a bit and now set -Dcassandra.max_queued_native_transport_requests=4096 and native_transport_max_threads=4096 Seeing no more blocked NTRs so far. Do you think this could have contributed to the problem? The default values seemed way too small for our load and our machine at any rate. Again, thanks for the help so far! David -- ----------------------------------- Technical Director / Co-Founder Sandbox Interactive GmbH http://albiononline.com --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org