[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830659#comment-17830659 ] Maxim Muzafarov commented on CASSANDRA-19332: - My +1 here as well. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830654#comment-17830654 ] Stefan Miklosovic commented on CASSANDRA-19332: --- [~mmuzaf] would you please explicitly +1 this, I just stumbled upon this ... you have indeed better visibility into this. Thank you very much. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830652#comment-17830652 ] Ariel Weisberg commented on CASSANDRA-19332: I think I figured out what happened. The latest test repo version was tested, but my branch was behind trunk so it didn't have the change to nodetool. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830649#comment-17830649 ] Ariel Weisberg commented on CASSANDRA-19332: So which build are you referring to? When I look at the trunk build that generated that error the log has Cloned https://github.com/apache/cassandra-dtest.git trunk to /workspace/context/cassandra-dtest; commit is 7a82b3757c136f79b52a76fdf3e98891dfff6b41 and that SHA comes after f3ca59c [https://github.com/apache/cassandra-dtest/commits/trunk/] I'll look a little deeper and try to figure out how the executors ended up running with the wrong commit on the dtest repo because clearly it happened. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830611#comment-17830611 ] Stefan Miklosovic commented on CASSANDRA-19332: --- j17 precommit tests (minus upgrade) https://app.circleci.com/pipelines/github/instaclustr/cassandra/4080/workflows/677745ba-4520-4e1d-951f-1829004d3235 from your test results for trunk I read: {code} ccmlib.node.ToolError: Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', 'cms', 'initialize'] exited with non-zero status; exit status: 1; stdout: nodetool: Found unexpected parameters: [cms, initialize] See 'nodetool help' or 'nodetool help '. ; stderr: Picked up JAVA_TOOL_OPTIONS: -XX:ErrorFile=/parallel-ci/output/hs_err_pid_%p.log {code} this is caused by you running old dtests, you probably havent pulled cassandra-dtest repo https://github.com/apache/cassandra-dtest/commit/f3ca59c76dc946e75b3b9dbc90f22dc51bf1c73a as i read it from the logs you uploaded, all dtest failures for trunk are about this > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830606#comment-17830606 ] Ariel Weisberg commented on CASSANDRA-19332: Sounds good, I will wait for that. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830602#comment-17830602 ] Stefan Miklosovic commented on CASSANDRA-19332: --- yeah that makes sense, I can run a CI for trunk in Circle for you to see if your CI was just a flaky or what ... give me some time here please. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830574#comment-17830574 ] Ariel Weisberg commented on CASSANDRA-19332: Yeah the Dropwizard fix makes this not necessary by fixing the cost of marking a meter that hasn't been used in a long time. We can remove this once that is fixed. Since we don't update dependencies in older versions so we will need this in 4.0/4.1, and 5.0 if it releases before Dropwizard does. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830554#comment-17830554 ] Stefan Miklosovic commented on CASSANDRA-19332: --- Too bad the PR for Dropwizard was not merge and released yet. Do I understand it correctly that we do not need to have Dropwizard's PR merge because we workaround it but this change will become not necessary anymore if PR is merged and we upgrade the library? > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830549#comment-17830549 ] Ariel Weisberg commented on CASSANDRA-19332: I haven't merged this for a while because I have had a steady stream of CI issues. Based on comparisons with nightlies it doesn't look like the test failures are related. 5.0 is basically clean, 4.0/4.1 have a bunch of upgrade test issues, and so does trunk. I think this is more specific to our CI environment and some issues invoking different upgrade paths. I'm going to go ahead and merge anyways unless someone objects since the scope of this change is pretty narrow and doesn't look related to any of the failures. > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0, 5.1 > > Attachments: ci_summary_4.1.html, ci_summary_5.0.html, > ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, > result_details_trunk.tar.gz > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812776#comment-17812776 ] Ariel Weisberg commented on CASSANDRA-19332: PR to fix the issue in Dropwizard [https://github.com/dropwizard/metrics/pull/3929] > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811415#comment-17811415 ] Maxim Muzafarov commented on CASSANDRA-19332: - Should we also create an issue for the Dropwizard Metrics? https://github.com/dropwizard/metrics/issues > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used
[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811399#comment-17811399 ] Stefan Miklosovic commented on CASSANDRA-19332: --- cc [~mmuzaf] > Dropwizard Meter causes timeouts when infrequently used > --- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just requests paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably ends up in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org