Re: Counter errors - RC1

Joe Obernberger Tue, 11 May 2021 07:22:49 -0700

One of the nodes was swapping in this case; fixed that - problem solved.

Yes - the machines are varying sizes and I wanted to test to see howwell a cluster would work in such a configuration.


-Joe

On 5/10/2021 8:14 PM, Kane Wilson wrote:

Seems like some of your nodes are overloaded. Is it intentional thatsome of your nodes have varying numbers of tokens?

It seems like some of your nodes are overloaded, potentially at least#RF of them. If nodes are heavily overloaded GC tuning generally won'thelp much, you're best off starting by reducing load or increasingcapacity.

raft.so <https://raft.so> - Cassandra consulting, support, andmanaged services

On Tue, May 11, 2021 at 7:44 AM Joe Obernberger<joseph.obernber...@gmail.com> wrote:


    Hi all - I'm getting the following error on RC1:

    WARN  [Messaging-EventLoop-3-23] 2021-05-10 17:29:12,431
    NoSpamLogger.java:95 -
    /172.16.100.39:7000->/172.16.100.248:7000-URGENT_MESSAGES-e8d21588
    dropping message of type FAILURE_RSP whose timeout expired before
    reaching the network
    ERROR [CounterMutationStage-62] 2021-05-10 17:29:12,431
    AbstractLocalAwareExecutorService.java:166 - Uncaught exception on
    thread Thread[CounterMutationStage-62,5,main]
    java.lang.RuntimeException:
    org.apache.cassandra.exceptions.WriteTimeoutException: Operation
    timed
    out - received only 0 responses.
            at
    
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2278)
            at
    
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
            at
    
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
            at
    
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
            at
    org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
            at
    
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
            at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: org.apache.cassandra.exceptions.WriteTimeoutException:
    Operation timed out - received only 0 responses.
            at
    
org.apache.cassandra.db.CounterMutation.grabCounterLocks(CounterMutation.java:162)
            at
    
org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:131)
            at
    
org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:1678)
            at
    
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2274)
            ... 6 common frames omitted

    This happens under load.

    I'm also seeing a lot of these messages:

    WARN  [GossipTasks:1] 2021-05-10 17:30:20,969
    FailureDetector.java:319
    - Not marking nodes down due to local pause of 5785753812ns >
    5000000000ns
    DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969
    FailureDetector.java:325 -
    Still not marking nodes down due to local pause
    DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969
    FailureDetector.java:325 -
    Still not marking nodes down due to local pause
    DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969
    FailureDetector.java:325 -
    Still not marking nodes down due to local pause

    The other messages are slow queries like:
    SELECT mediatype, origvalue FROM doc.origdoc WHERE uuid =
    DS_5_2021-05-08T06-53-41.442Z_Hi0ywdNE LIMIT 1>, time 1370 msec -
    slow
    timeout 500 msec

    I've tried switching the G1 garbage collector (java 11), and that did
    reduce these times (was seeing over 5000msec).  The above select
    statement is on a table where uuid is the primary key.

    Datacenter: datacenter1
    =======================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address         Load       Tokens  Owns
    (effective)  Host
    ID                               Rack
    UN  172.16.100.208  9.16 GiB   30    �
    9.3%            Â
    2529b6ed-cdb2-43c2-bdd7-171cfe308bd3  rack1
    UN  172.16.100.249  60.69 GiB  200   �
    62.9%           Â
    49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
    UN  172.16.100.36   61.16 GiB  200   �
    62.9%           Â
    d9702f96-256e-45ae-8e12-69a42712be50  rack1
    UN  172.16.100.39   61.07 GiB  200   �
    63.0%           Â
    93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
    UN  172.16.100.253  1.24 GiB   4     �
    1.3%            Â
    a1a16910-9167-4174-b34b-eb859d36347e  rack1
    UN  172.16.100.248  60.35 GiB  200   �
    62.9%           Â
    4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
    UN  172.16.100.37   37.18 GiB  120   �
    37.7%           Â
    08a19658-40be-4e55-8709-812b3d4ac750  rack1

    nodetool tablestats doc.origdoc
    Total number of tables: 74
    ----------------
    Keyspace : doc
            Read Count: 37511
            Read Latency: 33.929465116899046 ms
            Write Count: 4604965
            Write Latency: 0.20405303102195133 ms
            Pending Flushes: 0
                    Table: origdoc
                    SSTable count: 85
                    Old SSTable count: 0
                    Space used (live): 54635707180
                    Space used (total): 54635707180
                    Space used by snapshots (total): 0
                    Off heap memory used (total): 258773554
                    SSTable Compression Ratio:
    0.33099344385825985
                    Number of partitions (estimate):
    114982637
                    Memtable cell count: 0
                    Memtable data size: 0
                    Memtable off heap memory used: 0
                    Memtable switch count: 0
                    Local read count: 5749
                    Local read latency: 240.422 ms
                    Local write count: 0
                    Local write latency: NaN ms
                    Pending flushes: 0
                    Percent repaired: 0.01
                    Bloom filter false positives: 16
                    Bloom filter false ratio: 0.00000
                    Bloom filter space used: 141861208
                    Bloom filter off heap memory used:
    141860528
                    Index summary off heap memory used:
    44391250
                    Compression metadata off heap memory
    used: 72521776
                    Compacted partition minimum bytes: 259
                    Compacted partition maximum bytes: 4768
                    Compacted partition mean bytes: 1366
                    Average live cells per slice (last
    five
    minutes): 1.0
                    Maximum live cells per slice (last
    five
    minutes): 1
                    Average tombstones per slice (last
    five
    minutes): 1.0
                    Maximum tombstones per slice (last
    five
    minutes): 1
                    Dropped Mutations: 0
    Things to check?  Things to try?

    Thanks!

    -Joe

<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>Virus-free. www.avg.com<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>


<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Counter errors - RC1

Reply via email to