One of the nodes was swapping in this case; fixed that - problem solved.
Yes - the machines are varying sizes and I wanted to test to see how
well a cluster would work in such a configuration.
-Joe
On 5/10/2021 8:14 PM, Kane Wilson wrote:
Seems like some of your nodes are overloaded. Is it intentional that
some of your nodes have varying numbers of tokens?
It seems like some of your nodes are overloaded, potentially at least
#RF of them. If nodes are heavily overloaded GC tuning generally won't
help much, you're best off starting by reducing load or increasing
capacity.
raft.so <https://raft.so> - Cassandra consulting, support, and
managed services
On Tue, May 11, 2021 at 7:44 AM Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Hi all - I'm getting the following error on RC1:
WARN [Messaging-EventLoop-3-23] 2021-05-10 17:29:12,431
NoSpamLogger.java:95 -
/172.16.100.39:7000->/172.16.100.248:7000-URGENT_MESSAGES-e8d21588
dropping message of type FAILURE_RSP whose timeout expired before
reaching the network
ERROR [CounterMutationStage-62] 2021-05-10 17:29:12,431
AbstractLocalAwareExecutorService.java:166 - Uncaught exception on
thread Thread[CounterMutationStage-62,5,main]
java.lang.RuntimeException:
org.apache.cassandra.exceptions.WriteTimeoutException: Operation
timed
out - received only 0 responses.
at
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2278)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
at
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.cassandra.exceptions.WriteTimeoutException:
Operation timed out - received only 0 responses.
at
org.apache.cassandra.db.CounterMutation.grabCounterLocks(CounterMutation.java:162)
at
org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:131)
at
org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:1678)
at
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2274)
... 6 common frames omitted
This happens under load.
I'm also seeing a lot of these messages:
WARN [GossipTasks:1] 2021-05-10 17:30:20,969
FailureDetector.java:319
- Not marking nodes down due to local pause of 5785753812ns >
5000000000ns
DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969
FailureDetector.java:325 -
Still not marking nodes down due to local pause
DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969
FailureDetector.java:325 -
Still not marking nodes down due to local pause
DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969
FailureDetector.java:325 -
Still not marking nodes down due to local pause
The other messages are slow queries like:
SELECT mediatype, origvalue FROM doc.origdoc WHERE uuid =
DS_5_2021-05-08T06-53-41.442Z_Hi0ywdNE LIMIT 1>, time 1370 msec -
slow
timeout 500 msec
I've tried switching the G1 garbage collector (java 11), and that did
reduce these times (was seeing over 5000msec). The above select
statement is on a table where uuid is the primary key.
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
(effective) Host
ID Rack
UN 172.16.100.208 9.16 GiB 30 �
9.3% Â
2529b6ed-cdb2-43c2-bdd7-171cfe308bd3 rack1
UN 172.16.100.249 60.69 GiB 200 �
62.9% Â
49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
UN 172.16.100.36 61.16 GiB 200 �
62.9% Â
d9702f96-256e-45ae-8e12-69a42712be50 rack1
UN 172.16.100.39 61.07 GiB 200 �
63.0% Â
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
UN 172.16.100.253 1.24 GiB 4 �
1.3% Â
a1a16910-9167-4174-b34b-eb859d36347e rack1
UN 172.16.100.248 60.35 GiB 200 �
62.9% Â
4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
UN 172.16.100.37 37.18 GiB 120 �
37.7% Â
08a19658-40be-4e55-8709-812b3d4ac750 rack1
nodetool tablestats doc.origdoc
Total number of tables: 74
----------------
Keyspace : doc
Read Count: 37511
Read Latency: 33.929465116899046 ms
Write Count: 4604965
Write Latency: 0.20405303102195133 ms
Pending Flushes: 0
Table: origdoc
SSTable count: 85
Old SSTable count: 0
Space used (live): 54635707180
Space used (total): 54635707180
Space used by snapshots (total): 0
Off heap memory used (total): 258773554
SSTable Compression Ratio:
0.33099344385825985
Number of partitions (estimate):
114982637
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 5749
Local read latency: 240.422 ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.01
Bloom filter false positives: 16
Bloom filter false ratio: 0.00000
Bloom filter space used: 141861208
Bloom filter off heap memory used:
141860528
Index summary off heap memory used:
44391250
Compression metadata off heap memory
used: 72521776
Compacted partition minimum bytes: 259
Compacted partition maximum bytes: 4768
Compacted partition mean bytes: 1366
Average live cells per slice (last
five
minutes): 1.0
Maximum live cells per slice (last
five
minutes): 1
Average tombstones per slice (last
five
minutes): 1.0
Maximum tombstones per slice (last
five
minutes): 1
Dropped Mutations: 0
Things to check? Things to try?
Thanks!
-Joe
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free. www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>