Counter errors - RC1

Joe Obernberger Mon, 10 May 2021 14:44:01 -0700

Hi all - I'm getting the following error on RC1:

WARN [Messaging-EventLoop-3-23] 2021-05-10 17:29:12,431NoSpamLogger.java:95 -/172.16.100.39:7000->/172.16.100.248:7000-URGENT_MESSAGES-e8d21588dropping message of type FAILURE_RSP whose timeout expired beforereaching the networkERROR [CounterMutationStage-62] 2021-05-10 17:29:12,431AbstractLocalAwareExecutorService.java:166 - Uncaught exception onthread Thread[CounterMutationStage-62,5,main]java.lang.RuntimeException:org.apache.cassandra.exceptions.WriteTimeoutException: Operation timedout - received only 0 responses. atorg.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2278) atjava.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) atorg.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) atorg.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) atorg.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) atio.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

        at java.base/java.lang.Thread.run(Thread.java:829)

Caused by: org.apache.cassandra.exceptions.WriteTimeoutException:Operation timed out - received only 0 responses. atorg.apache.cassandra.db.CounterMutation.grabCounterLocks(CounterMutation.java:162) atorg.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:131) atorg.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:1678) atorg.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2274)

        ... 6 common frames omitted


This happens under load.

I'm also seeing a lot of these messages:

WARN [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:319- Not marking nodes down due to local pause of 5785753812ns > 5000000000nsDEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 -Still not marking nodes down due to local pauseDEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 -Still not marking nodes down due to local pauseDEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 -Still not marking nodes down due to local pause


The other messages are slow queries like:

SELECT mediatype, origvalue FROM doc.origdoc WHERE uuid =DS_5_2021-05-08T06-53-41.442Z_Hi0ywdNE LIMIT 1>, time 1370 msec - slowtimeout 500 msec

I've tried switching the G1 garbage collector (java 11), and that didreduce these times (was seeing over 5000msec). The above selectstatement is on a table where uuid is the primary key.


Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving

-- Address Load Tokens Owns(effective) HostID RackUN 172.16.100.208 9.16 GiB 30 9.3% � 2529b6ed-cdb2-43c2-bdd7-171cfe308bd3 rack1UN 172.16.100.249 60.69 GiB 200 62.9% � 49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1UN 172.16.100.36 61.16 GiB 200 62.9% � d9702f96-256e-45ae-8e12-69a42712be50 rack1UN 172.16.100.39 61.07 GiB 200 63.0% � 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1UN 172.16.100.253 1.24 GiB 4 1.3% � a1a16910-9167-4174-b34b-eb859d36347e rack1UN 172.16.100.248 60.35 GiB 200 62.9% � 4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1UN 172.16.100.37 37.18 GiB 120 37.7% � 08a19658-40be-4e55-8709-812b3d4ac750 rack1


nodetool tablestats doc.origdoc
Total number of tables: 74
----------------
Keyspace : doc
        Read Count: 37511
        Read Latency: 33.929465116899046 ms
        Write Count: 4604965
        Write Latency: 0.20405303102195133 ms
        Pending Flushes: 0
                Table: origdoc
                SSTable count: 85
                Old SSTable count: 0
                Space used (live): 54635707180
                Space used (total): 54635707180
                Space used by snapshots (total): 0
                Off heap memory used (total): 258773554

SSTable Compression Ratio:0.33099344385825985

                Number of partitions (estimate): 114982637
                Memtable cell count: 0
                Memtable data size: 0
                Memtable off heap memory used: 0
                Memtable switch count: 0
                Local read count: 5749
                Local read latency: 240.422 ms
                Local write count: 0
                Local write latency: NaN ms
                Pending flushes: 0
                Percent repaired: 0.01
                Bloom filter false positives: 16
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 141861208
                Bloom filter off heap memory used: 141860528
                Index summary off heap memory used: 44391250

Compression metadata off heap memoryused: 72521776

                Compacted partition minimum bytes: 259
                Compacted partition maximum bytes: 4768
                Compacted partition mean bytes: 1366

Average live cells per slice (last fiveminutes): 1.0 Maximum live cells per slice (last fiveminutes): 1 Average tombstones per slice (last fiveminutes): 1.0 Maximum tombstones per slice (last fiveminutes): 1

                Dropped Mutations: 0
Things to check?  Things to try?

Thanks!

-Joe

Counter errors - RC1

Reply via email to