Hi all - I'm getting the following error on RC1:

WARN  [Messaging-EventLoop-3-23] 2021-05-10 17:29:12,431 NoSpamLogger.java:95 - /172.16.100.39:7000->/172.16.100.248:7000-URGENT_MESSAGES-e8d21588 dropping message of type FAILURE_RSP whose timeout expired before reaching the network ERROR [CounterMutationStage-62] 2021-05-10 17:29:12,431 AbstractLocalAwareExecutorService.java:166 - Uncaught exception on thread Thread[CounterMutationStage-62,5,main] java.lang.RuntimeException: org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses.         at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2278)         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses.         at org.apache.cassandra.db.CounterMutation.grabCounterLocks(CounterMutation.java:162)         at org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:131)         at org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:1678)         at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2274)
        ... 6 common frames omitted

This happens under load.

I'm also seeing a lot of these messages:

WARN  [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:319 - Not marking nodes down due to local pause of 5785753812ns > 5000000000ns DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 - Still not marking nodes down due to local pause DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 - Still not marking nodes down due to local pause DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 - Still not marking nodes down due to local pause

The other messages are slow queries like:
SELECT mediatype, origvalue FROM doc.origdoc WHERE uuid = DS_5_2021-05-08T06-53-41.442Z_Hi0ywdNE LIMIT 1>, time 1370 msec - slow timeout 500 msec

I've tried switching the G1 garbage collector (java 11), and that did reduce these times (was seeing over 5000msec).  The above select statement is on a table where uuid is the primary key.

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack UN  172.16.100.208  9.16 GiB   30      9.3%             2529b6ed-cdb2-43c2-bdd7-171cfe308bd3  rack1 UN  172.16.100.249  60.69 GiB  200     62.9%            49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1 UN  172.16.100.36   61.16 GiB  200     62.9%            d9702f96-256e-45ae-8e12-69a42712be50  rack1 UN  172.16.100.39   61.07 GiB  200     63.0%            93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1 UN  172.16.100.253  1.24 GiB   4       1.3%             a1a16910-9167-4174-b34b-eb859d36347e  rack1 UN  172.16.100.248  60.35 GiB  200     62.9%            4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1 UN  172.16.100.37   37.18 GiB  120     37.7%            08a19658-40be-4e55-8709-812b3d4ac750  rack1

nodetool tablestats doc.origdoc
Total number of tables: 74
----------------
Keyspace : doc
        Read Count: 37511
        Read Latency: 33.929465116899046 ms
        Write Count: 4604965
        Write Latency: 0.20405303102195133 ms
        Pending Flushes: 0
                Table: origdoc
                SSTable count: 85
                Old SSTable count: 0
                Space used (live): 54635707180
                Space used (total): 54635707180
                Space used by snapshots (total): 0
                Off heap memory used (total): 258773554
                SSTable Compression Ratio: 0.33099344385825985
                Number of partitions (estimate): 114982637
                Memtable cell count: 0
                Memtable data size: 0
                Memtable off heap memory used: 0
                Memtable switch count: 0
                Local read count: 5749
                Local read latency: 240.422 ms
                Local write count: 0
                Local write latency: NaN ms
                Pending flushes: 0
                Percent repaired: 0.01
                Bloom filter false positives: 16
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 141861208
                Bloom filter off heap memory used: 141860528
                Index summary off heap memory used: 44391250
                Compression metadata off heap memory used: 72521776
                Compacted partition minimum bytes: 259
                Compacted partition maximum bytes: 4768
                Compacted partition mean bytes: 1366
                Average live cells per slice (last five minutes): 1.0                 Maximum live cells per slice (last five minutes): 1                 Average tombstones per slice (last five minutes): 1.0                 Maximum tombstones per slice (last five minutes): 1
                Dropped Mutations: 0
Things to check?  Things to try?

Thanks!

-Joe

Reply via email to