[jira] [Comment Edited] (CASSANDRA-19651) idealCLWriteLatency metric reports the worst response time instead of the time when ideal CL is satisfied

Dmitry Konstantinov (Jira) Thu, 22 Aug 2024 16:57:09 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876137#comment-17876137
 ]


Dmitry Konstantinov edited comment on CASSANDRA-19651 at 8/22/24 11:56 PM:
---------------------------------------------------------------------------

If I got correctly org.apache.cassandra.gms.Gossiper#waitToSettle just waits 
for 5 +3 x 1 = 8 seconds if there are no changes in the number of nodes 
discovered using gossip (even if we have not had any interactions with other 
nodes using gossip at all).
I have added a 5-second sleep to 
org.apache.cassandra.gms.Gossiper.GossipTask#run (we also have 1 second of 
initial delay when we schedule GossipTask)
{code:java}
    private class GossipTask implements Runnable
    {
        public void run()
        {
            try
            {
                //wait on messaging service to start listening
                MessagingService.instance().waitUntilListening();
                Thread.sleep(5000); // <===============================

                taskLock.lock();
{code}
and have got the NPE reproduced quite frequently for 
org.apache.cassandra.distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest:
{code:java}
java.lang.NullPointerException: Cannot invoke 
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
 because "state" is null

        at 
org.apache.cassandra.distributed.action.GossipHelper$PullSchemaFrom.lambda$accept$6adea493$1(GossipHelper.java:245)
        at 
org.apache.cassandra.distributed.impl.IsolatedExecutor.lambda$async$10(IsolatedExecutor.java:156)
        at 
org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:840)
{code}
So, it is possible that NPE in the original test run was caused by a delay of 
GossipTask execution...


was (Author: dnk):
If I got correctly org.apache.cassandra.gms.Gossiper#waitToSettle just waits 
for 5 +3 x 1 = 8 seconds if there are no changes in the number of nodes 
discovered using gossip (even if we have not had any interactions with other 
nodes using gossip at all).
I have added a 5-second sleep to 
org.apache.cassandra.gms.Gossiper.GossipTask#run (we also have 1 second of 
initial delay when we schedule GossipTask)
{code:java}
    private class GossipTask implements Runnable
    {
        public void run()
        {
            try
            {
                //wait on messaging service to start listening
                MessagingService.instance().waitUntilListening();
                Thread.sleep(5000); // <===============================

                taskLock.lock();
{code}
and have got the NPE reproduced quite frequently for 
org.apache.cassandra.distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest:
{code:java}
java.lang.NullPointerException: Cannot invoke 
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
 because "state" is null

        at 
org.apache.cassandra.distributed.action.GossipHelper$PullSchemaFrom.lambda$accept$6adea493$1(GossipHelper.java:245)
        at 
org.apache.cassandra.distributed.impl.IsolatedExecutor.lambda$async$10(IsolatedExecutor.java:156)
        at 
org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:840)
{code}
So, it is possible what NPE in the original test run was caused by a delay of 
GossipTask execution...

> idealCLWriteLatency metric reports the worst response time instead of the 
> time when ideal CL is satisfied
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Observability
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 4.0.14, 5.0.1, 5.1, 4.1.7
>
>         Attachments: 
> ci_summary-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.html, 
> ci_summary-cassandra-4.1-1ed312f881c0c170c8833ff9fbf397ab8fc625cc.html, 
> ci_summary-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.html, 
> ci_summary-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.html, 
> result_details-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.tar.gz, 
> result_details-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.tar.gz, 
> result_details-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.tar.gz, 
> select-junit-tests-rerun-4.1.zip
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> org.apache.cassandra.service.AbstractWriteResponseHandler:
> {code:java}
> private final void decrementResponseOrExpired()
> {
>     int decrementedValue = responsesAndExpirations.decrementAndGet();
>     if (decrementedValue == 0)
>     {
>         // The condition being signaled is a valid proxy for the CL being 
> achieved
>         // Only mark it as failed if the requested CL was achieved.
>         if (!condition.isSignalled() && requestedCLAchieved)
>         {
>             replicaPlan.keyspace().metric.writeFailedIdealCL.inc();
>         }
>         else
>         {
>             
> replicaPlan.keyspace().metric.idealCLWriteLatency.addNano(nanoTime() - 
> queryStartNanoTime);
>         }
>     }
> } {code}
> Actual result: responsesAndExpirations is a total number of replicas across 
> all DCs which does not depend on the ideal CL, so the metric value for 
> replicaPlan.keyspace().metric.idealCLWriteLatency is updated when we get the 
> latest response/timeout for all replicas.
> Expected result: replicaPlan.keyspace().metric.idealCLWriteLatency is updated 
> when we get enough responses from replicas according to the ideal CL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-19651) idealCLWriteLatency metric reports the worst response time instead of the time when ideal CL is satisfied

Reply via email to