[
https://issues.apache.org/jira/browse/CASSANDRA-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886529#comment-17886529
]
Dmitry Konstantinov commented on CASSANDRA-19702:
-------------------------------------------------
Please find the analysis of the test failure below with a suggestion how to fix
it. It took more time that I expected initially due to discovering of some
issues which are not related directly to the test failure but are important in
general.
== Background
The failing test was introduced as a part of CASSANDRA-8670
([https://github.com/aweisberg/cassandra/commit/1b2e4f31871a4d0cadfb3bce0d1e38c711ecdeef])
to test fixing of native memory leaking through JDK-level thread-local caches
for direct buffers.
The test sends few CQL inserts with a large column value (63 MiB) and asserts
that Cassandra server direct byte buffer memory usage is not increased a lot
after it.
Observation: [https://bugs.openjdk.org/browse/JDK-8147468] - a JDK-level fix
for the issue (by adding of -Djdk.nio.MaxCachedBufferSize option) was
implemented some time after CASSANDRA-8670 and backported to JDK8.
I have created a ticket to consider if it makes sense to add
-Djdk.nio.MaxCachedBufferSize by default to Cassandra JVM settings:
CASSANDRA-19976
== Root cause analysis
I have observed 2 causes for the test failure:
1) Sometimes the test fails with OutOfMemory error in one of started Cassandra
server nodes. Based on heap dump analysis heap there is a kind of memory leak
through thread locals, more details are captured in CASSANDRA-19942, the issue
has been fixed recently in CASSANDRA-19785.
2) The primary reason for the test to fail is hitting specific corner case of
node overload logic. The test sends a big request message. v5 protocol is used,
so the message is split to multiple frames which are re-aggregated on a server
side, more precisely the logic in
org.apache.cassandra.transport.CQLMessageHandler#processFirstFrameOfLargeMessage
method is used to process the first frame of the large request. The attached
diagram show it in more details: [^cassandra_rate_limit.svg]
Observations:
- the test message size is bigger than global and per endpoint limits:
global limit is controlled by native_transport_max_request_data_in_flight
config option and by default it is Xmx / 10. The test is executed with Xmx =
495 MiB, so native_transport_max_request_data_in_flight = 49 MiB.
same story for per endpoint limit, it is controlled by
native_transport_max_request_data_in_flight_per_ip option with default = Xmx /
40 = 12 MiB. So, both limits are less than the big value sent by the test.
- when we cannot allocate the required capacity for a big message we still
process it and it is not actually classified as an overload. The overload logic
was changed when v5 protocol frames were introduced and later it was adjusted
additionally to avoid client starvation in commit:
[https://github.com/beobal/cassandra/commit/0aae5b4921a3ce6c21a8a0e977624c877b19cd5b]
(a part of CASSANDRA-14973)
- The error log message "Could not aquire capacity while processing native
protocol message" was added recently as a part of CASSANDRA-19534 commit but it
was added for the already existing logic branch, I suppose it was done to
highlight what we are hitting a pathological case.
- the python test generic logic checks for any errors in logs and fails due to
observing this error message.
== Suggested changes
The current test itself was designed not for testing overload use cases, so I
suggest the following fix plan:
1) fix the test itself by configuring
native_transport_max_request_data_in_flight and
native_transport_max_request_data_in_flight_per_ip to handle the messages using
a normal processing logic. MR for the change:
[https://github.com/apache/cassandra-dtest/pull/268]
2) I will create a separate ticket to improve handling of the pathological
overload branch. Currently, we do not have a real protection against
overloading of Cassandra server with big CQL messages.
> Test failure: largecolumn_test.TestLargeColumn
> ----------------------------------------------
>
> Key: CASSANDRA-19702
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19702
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Core
> Reporter: Brandon Williams
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 4.1.x
>
> Attachments: cassandra_rate_limit.svg, ci_summary.html
>
>
> https://app.circleci.com/pipelines/github/driftx/cassandra/1671/workflows/980121dd-ac7f-4747-91fa-391d961c9c5b/jobs/93813/tests
> {quote}
> failed on teardown with "Unexpected error found in node logs (see stdout for
> full details). Errors: [[node1] 'ERROR [epollEventLoopGroup-5-8] 2024-06-13
> 15:06:46,808 NoSpamLogger.java:111 - Could not aquire capacity while
> processing native protocol message', [node2] 'ERROR [epollEventLoopGroup-5-4]
> 2024-06-13 15:06:48,851 NoSpamLogger.java:111 - Could not aquire capacity
> while processing native protocol message']"
> Unexpected error found in node logs (see stdout for full details). Errors:
> [[node1] 'ERROR [epollEventLoopGroup-5-8] 2024-06-13 15:06:46,808
> NoSpamLogger.java:111 - Could not aquire capacity while processing native
> protocol message', [node2] 'ERROR [epollEventLoopGroup-5-4] 2024-06-13
> 15:06:48,851 NoSpamLogger.java:111 - Could not aquire capacity while
> processing native protocol message']
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]