[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830611#comment-17830611
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19332 at 3/25/24 5:23 PM:
------------------------------------------------------------------------

j17 precommit tests (minus upgrade) 
[https://app.circleci.com/pipelines/github/instaclustr/cassandra/4080/workflows/677745ba-4520-4e1d-951f-1829004d3235]

btw I just realized that I will not be able to run dtests because my CircleCI 
plan is timeouting all jobs after 1h and I do not have parallelism big enough 
to fit it into that limit either.

 

from your test results for trunk I read:
{code:java}
ccmlib.node.ToolError: Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', 
'cms', 'initialize'] exited with non-zero status; exit status: 1; stdout: 
nodetool: Found unexpected parameters: [cms, initialize] See 'nodetool help' or 
'nodetool help <command>'. ; stderr: Picked up JAVA_TOOL_OPTIONS: 
-XX:ErrorFile=/parallel-ci/output/hs_err_pid_%p.log
{code}
this is caused by you running old dtests, you probably havent pulled 
cassandra-dtest repo

[https://github.com/apache/cassandra-dtest/commit/f3ca59c76dc946e75b3b9dbc90f22dc51bf1c73a]

as i read it from the logs you uploaded, all dtest failures for trunk are about 
this


was (Author: smiklosovic):
j17 precommit tests (minus upgrade) 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4080/workflows/677745ba-4520-4e1d-951f-1829004d3235


from your test results for trunk I read:

{code}
ccmlib.node.ToolError: Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', 
'cms', 'initialize'] exited with non-zero status; exit status: 1; stdout: 
nodetool: Found unexpected parameters: [cms, initialize] See 'nodetool help' or 
'nodetool help <command>'. ; stderr: Picked up JAVA_TOOL_OPTIONS: 
-XX:ErrorFile=/parallel-ci/output/hs_err_pid_%p.log
{code}

this is caused by you running old dtests, you probably havent pulled 
cassandra-dtest repo

https://github.com/apache/cassandra-dtest/commit/f3ca59c76dc946e75b3b9dbc90f22dc51bf1c73a

as i read it from the logs you uploaded, all dtest failures for  trunk are 
about this

> Dropwizard Meter causes timeouts when infrequently used
> -------------------------------------------------------
>
>                 Key: CASSANDRA-19332
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Observability/Metrics
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
>         Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to