[
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822372#comment-16822372
]
Sumanth Pasupuleti edited comment on CASSANDRA-15013 at 4/20/19 6:05 AM:
-------------------------------------------------------------------------
Updated patch: [https://github.com/apache/cassandra/pull/313]
Passing UTs and DTests:
https://circleci.com/workflow-run/31dabaa6-eab8-4f00-a711-f1b210bf7578
Thanks [~benedict]. I learnt from your suggestion, {{Ref}} class is useful for
getting around the race conditions I was initially worried about, to evict
endpoint from the map.
Attached patch evicts endpoint along the lines of your proposal, except that, I
used a new class {{EndpointPayloadTracker}}, in place of suggested class
({{Dispatcher}}). Having Dispatcher mapped against endpoint makes it as 1:1
Dispatcher per endpoint, whereas currently it is one Dispatcher per Channel,
and I rely on that association to store channel level inflight payload, which
is then useful to turn off backpressure on a channel (one of the conditions I
check to {{setAutoRead}}(true) is when channel level inflight payload comes
down to zero).
A few other changes I have made as part of this updated patch
* Removed channel level threshold with the worry of too many config knobs
(channel level, endpoint level, global level). So each time endpoint/global
thresholds are exceeded, a channel is put backpressure on, or an
overloadedexception is thrown.
* In addition to memory based limit, added another tracker and limit check
based on number of requests in flight - this is to keep a check on a situation
where there are too many in-coming requests with small enough payload that get
around memory limit checks, but result in blocking event loop threads.
was (Author: sumanth.pasupuleti):
Updated patch: [https://github.com/apache/cassandra/pull/313]
Thanks [~benedict]. I learnt from your suggestion, {{Ref}} class is useful for
getting around the race conditions I was initially worried about, to evict
endpoint from the map.
Attached patch evicts endpoint along the lines of your proposal, except that, I
used a new class {{EndpointPayloadTracker}}, in place of suggested class
({{Dispatcher}}). Having Dispatcher mapped against endpoint makes it as 1:1
Dispatcher per endpoint, whereas currently it is one Dispatcher per Channel,
and I rely on that association to store channel level inflight payload, which
is then useful to turn off backpressure on a channel (one of the conditions I
check to {{setAutoRead}}(true) is when channel level inflight payload comes
down to zero).
A few other changes I have made as part of this updated patch
* Removed channel level threshold with the worry of too many config knobs
(channel level, endpoint level, global level). So each time endpoint/global
thresholds are exceeded, a channel is put backpressure on, or an
overloadedexception is thrown.
* In addition to memory based limit, added another tracker and limit check
based on number of requests in flight - this is to keep a check on a situation
where there are too many in-coming requests with small enough payload that get
around memory limit checks, but result in blocking event loop threads.
> Message Flusher queue can grow unbounded, potentially running JVM out of
> memory
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-15013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
> Project: Cassandra
> Issue Type: Bug
> Components: Messaging/Client
> Reporter: Sumanth Pasupuleti
> Assignee: Sumanth Pasupuleti
> Priority: Normal
> Labels: pull-request-available
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: BlockedEpollEventLoopFromHeapDump.png,
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue
> bounded, since, in the current state, items get added to the queue without
> any checks on queue size, nor with any checks on netty outbound buffer to
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]