[PR] CASSANDRA-19457: Memory Leak of `DefaultSession` [cassandra-java-driver]

via GitHub Thu, 07 Mar 2024 14:34:56 -0800


SiyaoIsHiding opened a new pull request, #1916:
URL: https://github.com/apache/cassandra-java-driver/pull/1916


   It is leaked by micrometer gauge initialization.
   I used the following `application.conf`, which includes all node and session 
level metrics, and the memory leak is gone. 
   ```
   datastax-java-driver.advanced.metrics {
     session.enabled = [
       # The number and rate of bytes sent for the entire session (exposed as a 
Meter).
        bytes-sent,
   
       # The number and rate of bytes received for the entire session (exposed 
as a Meter).
        bytes-received
   
       # The number of nodes to which the driver has at least one active 
connection (exposed as a
       # Gauge<Integer>).
        connected-nodes,
   
       # The throughput and latency percentiles of CQL requests (exposed as a 
Timer).
       #
       # This corresponds to the overall duration of the session.execute() 
call, including any
       # retry.
        cql-requests,
   
       # The number of CQL requests that timed out -- that is, the 
session.execute() call failed
       # with a DriverTimeoutException (exposed as a Counter).
        cql-client-timeouts,
   
       # The size of the driver-side cache of CQL prepared statements.
       #
       # The cache uses weak values eviction, so this represents the number of 
PreparedStatement
       # instances that your application has created, and is still holding a 
reference to. Note
       # that the returned value is approximate.
        cql-prepared-cache-size,
   
       # How long requests are being throttled (exposed as a Timer).
       #
       # This is the time between the start of the session.execute() call, and 
the moment when
       # the throttler allows the request to proceed.
        throttling.delay,
   
       # The size of the throttling queue (exposed as a Gauge<Integer>).
       #
       # This is the number of requests that the throttler is currently 
delaying in order to
       # preserve its SLA. This metric only works with the built-in 
concurrency- and rate-based
       # throttlers; in other cases, it will always be 0.
        throttling.queue-size,
   
       # The number of times a request was rejected with a 
RequestThrottlingException (exposed as
       # a Counter)
        throttling.errors,
   
       # The throughput and latency percentiles of DSE continuous CQL requests 
(exposed as a
       # Timer).
       #
       # This metric is a session-level metric and corresponds to the overall 
duration of the
       # session.executeContinuously() call, including any retry.
       #
       # Note that this metric is analogous to the OSS driver's 'cql-requests' 
metrics, but for
       # continuous paging requests only. Continuous paging requests do not 
update the
       # 'cql-requests' metric, because they are usually much longer. Only the 
following metrics
       # are updated during a continuous paging request:
       #
       # - At node level: all the usual metrics available for normal CQL 
requests, such as
       #   'cql-messages' and error-related metrics (but these are only updated 
for the first
       #   page of results);
       # - At session level: only 'continuous-cql-requests' is updated (this 
metric).
        continuous-cql-requests,
   
       # The throughput and latency percentiles of Graph requests (exposed as a 
Timer).
       #
       # This metric is a session-level metric and corresponds to the overall 
duration of the
       # session.execute(GraphStatement) call, including any retry.
        graph-requests,
   
       # The number of graph requests that timed out -- that is, the
       # session.execute(GraphStatement) call failed with a 
DriverTimeoutException (exposed as a
       # Counter).
       #
       # Note that this metric is analogous to the OSS driver's 
'cql-client-timeouts' metrics, but
       # for Graph requests only.
        graph-client-timeouts
     ]
     node.enabled = [
       # The number of connections open to this node for regular requests 
(exposed as a
       # Gauge<Integer>).
       #
       # This includes the control connection (which uses at most one extra 
connection to a
       # random node in the cluster).
        pool.open-connections,
   
       # The number of stream ids available on the connections to this node 
(exposed as a
       # Gauge<Integer>).
       #
       # Stream ids are used to multiplex requests on each connection, so this 
is an indication
       # of how many more requests the node could handle concurrently before 
becoming saturated
       # (note that this is a driver-side only consideration, there might be 
other limitations on
       # the server that prevent reaching that theoretical limit).
        pool.available-streams,
   
       # The number of requests currently executing on the connections to this 
node (exposed as a
       # Gauge<Integer>). This includes orphaned streams.
        pool.in-flight,
   
       # The number of "orphaned" stream ids on the connections to this node 
(exposed as a
       # Gauge<Integer>).
       #
       # See the description of the connection.max-orphan-requests option for 
more details.
        pool.orphaned-streams,
   
       # The number and rate of bytes sent to this node (exposed as a Meter).
        bytes-sent,
   
       # The number and rate of bytes received from this node (exposed as a 
Meter).
        bytes-received,
   
       # The throughput and latency percentiles of individual CQL messages sent 
to this node as
       # part of an overall request (exposed as a Timer).
       #
       # Note that this does not necessarily correspond to the overall duration 
of the
       # session.execute() call, since the driver might query multiple nodes 
because of retries
       # and speculative executions. Therefore a single "request" (as seen from 
a client of the
       # driver) can be composed of more than one of the "messages" measured by 
this metric.
       #
       # Therefore this metric is intended as an insight into the performance 
of this particular
       # node. For statistics on overall request completion, use the 
session-level cql-requests.
        cql-messages,
   
       # The number of times the driver failed to send a request to this node 
(exposed as a
       # Counter).
       #
       # In those case we know the request didn't even reach the coordinator, 
so they are retried
       # on the next node automatically (without going through the retry 
policy).
        errors.request.unsent,
   
       # The number of times a request was aborted before the driver even 
received a response
       # from this node (exposed as a Counter).
       #
       # This can happen in two cases: if the connection was closed due to an 
external event
       # (such as a network error or heartbeat failure); or if there was an 
unexpected error
       # while decoding the response (this can only be a driver bug).
        errors.request.aborted,
   
       # The number of times this node replied with a WRITE_TIMEOUT error 
(exposed as a Counter).
       #
       # Whether this error is rethrown directly to the client, rethrown or 
ignored is determined
       # by the RetryPolicy.
        errors.request.write-timeouts,
   
       # The number of times this node replied with a READ_TIMEOUT error 
(exposed as a Counter).
       #
       # Whether this error is rethrown directly to the client, rethrown or 
ignored is determined
       # by the RetryPolicy.
        errors.request.read-timeouts,
   
       # The number of times this node replied with an UNAVAILABLE error 
(exposed as a Counter).
       #
       # Whether this error is rethrown directly to the client, rethrown or 
ignored is determined
       # by the RetryPolicy.
        errors.request.unavailables,
   
       # The number of times this node replied with an error that doesn't fall 
under other
       # 'errors.*' metrics (exposed as a Counter).
        errors.request.others,
   
       # The total number of errors on this node that caused the RetryPolicy to 
trigger a retry
       # (exposed as a Counter).
       #
       # This is a sum of all the other retries.* metrics.
        retries.total,
   
       # The number of errors on this node that caused the RetryPolicy to 
trigger a retry, broken
       # down by error type (exposed as Counters).
        retries.aborted,
        retries.read-timeout,
        retries.write-timeout,
        retries.unavailable,
        retries.other,
   
       # The total number of errors on this node that were ignored by the 
RetryPolicy (exposed as
       # a Counter).
       #
       # This is a sum of all the other ignores.* metrics.
        ignores.total,
   
       # The number of errors on this node that were ignored by the 
RetryPolicy, broken down by
       # error type (exposed as Counters).
        ignores.aborted,
        ignores.read-timeout,
        ignores.write-timeout,
        ignores.unavailable,
        ignores.other,
   
       # The number of speculative executions triggered by a slow response from 
this node
       # (exposed as a Counter).
        speculative-executions,
   
       # The number of errors encountered while trying to establish a 
connection to this node
       # (exposed as a Counter).
       #
       # Connection errors are not a fatal issue for the driver, failed 
connections will be
       # retried periodically according to the reconnection policy. You can 
choose whether or not
       # to log those errors at WARN level with the 
connection.warn-on-init-error option.
       #
       # Authentication errors are not included in this counter, they are 
tracked separately in
       # errors.connection.auth.
        errors.connection.init,
   
       # The number of authentication errors encountered while trying to 
establish a connection
       # to this node (exposed as a Counter).
       # Authentication errors are also logged at WARN level.
        errors.connection.auth,
   
       # The throughput and latency percentiles of individual graph messages 
sent to this node as
       # part of an overall request (exposed as a Timer).
       #
       # Note that this does not necessarily correspond to the overall duration 
of the
       # session.execute() call, since the driver might query multiple nodes 
because of retries
       # and speculative executions. Therefore a single "request" (as seen from 
a client of the
       # driver) can be composed of more than one of the "messages" measured by 
this metric.
       #
       # Therefore this metric is intended as an insight into the performance 
of this particular
       # node. For statistics on overall request completion, use the 
session-level graph-requests.
        graph-messages,
     ]
     factory.class = MicrometerMetricsFactory
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] CASSANDRA-19457: Memory Leak of `DefaultSession` [cassandra-java-driver]

Reply via email to