[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Benedict (JIRA) Fri, 17 May 2019 05:59:47 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842121#comment-16842121
 ]


Benedict edited comment on CASSANDRA-15013 at 5/17/19 12:58 PM:
----------------------------------------------------------------

Thanks [~sumanth.pasupuleti], the patch is looking really good.  Some remaining 
questions:

* Do we need requestsProcessed metric?  We already have 
{{regularStatementsExecuted}} and {{preparedStatementsExecuted}} which should 
track closely for the traffic we care about.
* Conversely, do we want some metric to track back pressure being deployed?  
It’s not clear exactly what semantics we would want to maintain here, since we 
don’t _currently_ pause all channels for a given endpoint when the endpoint 
overflows, and it’s also unclear if we would want to track this per-client 
(probably not, although it would be really nice to do so)
* I think it would be nice to manage {{requestPayloadInFlightPerEndpoint}} 
entirely inside {{EndpointPayloadTracker}} ; it's presently only accessed once 
outside in an adjacent class, but it would be very simple to hide the map 
entirely, as well as {{tryRef}}, and simply offer a {{public static get}} 
method in {{EndpointPayloadTracker}}.  WDYT?
* It might also be nice to introduce a new version of 
{{EndpointAndGlobal.release}} that informs the caller if we are presently above 
or below the limits.  This would simplify the re-activation of a channel.

What do you also think about starting/stopping all channels for an endpoint at 
once, when we cross the threshold?  I don't think it is essential, but is 
probably worth considering, as it makes our limits even less clearly defined 
(given we're permitted to cross them already, once per channel; it would be 
nice to tighten that to once per-endpoint)


was (Author: benedict):
Thanks [~sumanth.pasupuleti], the patch is looking really good.  Some remaining 
questions:

* Do we need requestsProcessed metric?  We already have 
{{regularStatementsExecuted}} and {{preparedStatementsExecuted}} which should 
track closely for the traffic we care about.
* Conversely, do we want some metric to track back pressure being deployed?  
It’s not clear exactly what semantics we would want to maintain here, since we 
don’t _currently_ pause all channels for a given endpoint when the endpoint 
overflows, and it’s also unclear if we would want to track this per-client 
(probably not, although it would be really nice to do so)
* I think it would be nice to manage {{requestPayloadInFlightPerEndpoint}} 
entirely inside {{EndpointPayloadTracker}} ; it's presently only accessed once 
outside in an adjacent class, but it would be very simple to hide the map 
entirely, as well as {{tryRef}}, and simply offer a {{public static get}} 
method in {{EndpointPayloadTracker}}.  WDYT?

What do you also think about starting/stopping all channels for an endpoint at 
once, when we cross the threshold?  I don't think it is essential, but is 
probably worth considering, as it makes our limits even less clearly defined 
(given we're permitted to cross them already, once per channel; it would be 
nice to tighten that to once per-endpoint)

> Message Flusher queue can grow unbounded, potentially running JVM out of 
> memory
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15013
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: BlockedEpollEventLoopFromHeapDump.png, 
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap 
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue 
> bounded, since, in the current state, items get added to the queue without 
> any checks on queue size, nor with any checks on netty outbound buffer to 
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

Reply via email to