[
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Petrov updated CASSANDRA-19215:
------------------------------------
Attachment: ci_summary.html
result_details.tar.gz
> "Query start time" in native transport request threads should be the task
> enqueue time
> --------------------------------------------------------------------------------------
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
> Issue Type: Bug
> Components: Messaging/Client
> Reporter: Runtian Liu
> Assignee: Alex Petrov
> Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in
> expensive traffic from the application side. This surge involved a large
> volume of costly read queries, which took a considerable amount of time to
> process on the server side. The client had timeout settings; if a request
> timed out, it might trigger the sending of new requests. Since the server
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks
> queued in the Native-Transport-Request pending queue. I expected that once
> the application ceased sending requests, the server node would quickly return
> to normal, as most requests in the queue were over half an hour old and
> should have timed out rapidly, clearing the queue. However, it actually took
> an hour to clear the native transport's pending queue, even with native
> transport disabled. Upon examining the code, I noticed that for read/write
> requests, the
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
> which determines if a request has timed out, only begins when the task
> starts processing. This means that no matter how long a request has been
> pending, it doesn't contribute to the timeout. I believe this is incorrect.
> The timer should start when the Cassandra server receives the request or when
> it enqueues the task, not when the request/task begins processing. This way,
> an overloaded node with many pending tasks can quickly discard timed-out
> requests and recover from an outage once new requests stop.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]