[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830652#comment-17830652
 ] 

Ariel Weisberg commented on CASSANDRA-19332:
--------------------------------------------

I think I figured out what happened. The latest test repo version was tested, 
but my branch was behind trunk so it didn't have the change to nodetool.

> Dropwizard Meter causes timeouts when infrequently used
> -------------------------------------------------------
>
>                 Key: CASSANDRA-19332
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Observability/Metrics
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
>         Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to