[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Maxim Muzafarov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830659#comment-17830659
 ] 

Maxim Muzafarov commented on CASSANDRA-19332:
-

My +1 here as well. 

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830654#comment-17830654
 ] 

Stefan Miklosovic commented on CASSANDRA-19332:
---

[~mmuzaf]  would you please explicitly +1 this, I just stumbled upon this ... 
you have indeed better visibility into this. Thank you very much.

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Ariel Weisberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830652#comment-17830652
 ] 

Ariel Weisberg commented on CASSANDRA-19332:


I think I figured out what happened. The latest test repo version was tested, 
but my branch was behind trunk so it didn't have the change to nodetool.

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Ariel Weisberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830649#comment-17830649
 ] 

Ariel Weisberg commented on CASSANDRA-19332:


So which build are you referring to? When I look at the trunk build that 
generated that error the log has
Cloned https://github.com/apache/cassandra-dtest.git trunk to 
/workspace/context/cassandra-dtest; commit is 
7a82b3757c136f79b52a76fdf3e98891dfff6b41
 
and that SHA comes after f3ca59c 
[https://github.com/apache/cassandra-dtest/commits/trunk/]

I'll look a little deeper and try to figure out how the executors ended up 
running with the wrong commit on the dtest repo because clearly it happened.
 

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830611#comment-17830611
 ] 

Stefan Miklosovic commented on CASSANDRA-19332:
---

j17 precommit tests (minus upgrade) 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4080/workflows/677745ba-4520-4e1d-951f-1829004d3235


from your test results for trunk I read:

{code}
ccmlib.node.ToolError: Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', 
'cms', 'initialize'] exited with non-zero status; exit status: 1; stdout: 
nodetool: Found unexpected parameters: [cms, initialize] See 'nodetool help' or 
'nodetool help '. ; stderr: Picked up JAVA_TOOL_OPTIONS: 
-XX:ErrorFile=/parallel-ci/output/hs_err_pid_%p.log
{code}

this is caused by you running old dtests, you probably havent pulled 
cassandra-dtest repo

https://github.com/apache/cassandra-dtest/commit/f3ca59c76dc946e75b3b9dbc90f22dc51bf1c73a

as i read it from the logs you uploaded, all dtest failures for  trunk are 
about this

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Ariel Weisberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830606#comment-17830606
 ] 

Ariel Weisberg commented on CASSANDRA-19332:


Sounds good, I will wait for that.

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830602#comment-17830602
 ] 

Stefan Miklosovic commented on CASSANDRA-19332:
---

yeah that makes sense, I can run a CI for trunk in Circle for you to see if 
your CI was just a flaky or what ... give me some time here please. 

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Ariel Weisberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830574#comment-17830574
 ] 

Ariel Weisberg commented on CASSANDRA-19332:


Yeah the Dropwizard fix makes this not necessary by fixing the cost of marking 
a meter that hasn't been used in a long time. We can remove this once that is 
fixed. Since we don't update dependencies in older versions so we will need 
this in 4.0/4.1, and 5.0 if it releases before Dropwizard does.

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830554#comment-17830554
 ] 

Stefan Miklosovic commented on CASSANDRA-19332:
---

Too bad the PR for Dropwizard was not merge and released yet. Do I understand 
it correctly that we do not need to have Dropwizard's PR merge because we 
workaround it but this change will become not necessary anymore if PR is merged 
and we upgrade the library?

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-03-25 Thread Ariel Weisberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830549#comment-17830549
 ] 

Ariel Weisberg commented on CASSANDRA-19332:


I haven't merged this for a while because I have had a steady stream of CI 
issues. Based on comparisons with nightlies it doesn't look like the test 
failures are related. 5.0 is basically clean, 4.0/4.1 have a bunch of upgrade 
test issues, and so does trunk. I think this is more specific to our CI 
environment and some issues invoking different upgrade paths.

I'm going to go ahead and merge anyways unless someone objects since the scope 
of this change is pretty narrow and doesn't look related to any of the failures.

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0, 5.1
>
> Attachments: ci_summary_4.1.html, ci_summary_5.0.html, 
> ci_summary_trunk.html, result_details_4.1.tar.gz, result_details_5.0.tar.gz, 
> result_details_trunk.tar.gz
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-01-31 Thread Ariel Weisberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812776#comment-17812776
 ] 

Ariel Weisberg commented on CASSANDRA-19332:


PR to fix the issue in Dropwizard 
[https://github.com/dropwizard/metrics/pull/3929]

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-01-26 Thread Maxim Muzafarov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811415#comment-17811415
 ] 

Maxim Muzafarov commented on CASSANDRA-19332:
-

Should we also create an issue for the Dropwizard Metrics?
https://github.com/dropwizard/metrics/issues

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just request paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably results in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19332) Dropwizard Meter causes timeouts when infrequently used

2024-01-26 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811399#comment-17811399
 ] 

Stefan Miklosovic commented on CASSANDRA-19332:
---

cc [~mmuzaf] 

> Dropwizard Meter causes timeouts when infrequently used
> ---
>
> Key: CASSANDRA-19332
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19332
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
>
> Observed instances of timeouts on clusters with long uptime and infrequently 
> used tables and possibly just requests paths such as not using CAS for large 
> fractions of a year.
> CAS seems to be more severely impacted because it has more metrics in the 
> request path such as latency measurements for prepare, propose, and the read 
> from the underlying table.
> Tracing showed ~600-800 milliseconds for these operations in between the 
> “appending to memtable” and “sending a response” events. Reads had a delay 
> between finishing the construction of the iterator and sending the read 
> response.
> Stack traces dumped every 100 milliseconds using {{sjk}} shows that in 
> prepare and propose a lot of time was being spent in 
> {{{}Meter.tickIfNecessary{}}}.
> {code:java}
> Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92)
> Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4
> com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72)
> com.codahale.metrics.Meter.mark(Meter.java:55)
> com.codahale.metrics.Meter.mark(Meter.java:46)
> com.codahale.metrics.Timer.update(Timer.java:150)
> com.codahale.metrics.Timer.update(Timer.java:86)
> org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159)
> org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code}
> {{tickIfNecessary}} does a linear amount of work proportional to the time 
> since the last time the metric was updated/read/created and this can actually 
> take a measurable amount of time even in a tight loop. On my M2 MBP it was 
> 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed 
> up it was 140 milliseconds.
> A quick fix is to schedule a task to read all the meters once a day so it 
> isn’t done in the request path and we have a more incremental amount to 
> process at a time.
> Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it 
> takes longer than 5 seconds to run the loop it can end up with multiple 
> threads attempting to run the loop at once and then they will concurrently 
> run {{EWMA.tick}} which probably ends up in some ticks not being performed.
> This issue is still present in the latest version of {{Metrics}} if using 
> {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded 
> amount of work required to tick.  Switching would change how our metrics work 
> since the two don't have the same behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org