[
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843339#comment-16843339
]
Sumanth Pasupuleti edited comment on CASSANDRA-15013 at 5/19/19 7:19 AM:
-------------------------------------------------------------------------
Thanks for the additional feedback. I've made following further changes to the
[patch|https://github.com/sumanth-pasupuleti/cassandra/commit/98126f5d887228f5e88eca66f007873b52a0aacf]:
* Removed {{requestsProcessed}} metric.
* Added {{BackpressureDeployed}} metric to indicate how many times server
attempted to apply backpressure.
* {{EndpointPayloadTracker}} manages {{requestPayloadInFlightPerEndpoint}}, and
an external consumer of {{EndpointPayloadTracker}} calls a static {{get}} that
internally does {{tryRef}}
* {{EndpointAndGlobal.release}} returns {{ABOVE_LIMIT}} or {{BELOW_LIMIT}}
Outcome.
[Passing
Tests|https://circleci.com/workflow-run/561b655c-c3d8-4ea6-9997-ed24754ad133]
{{EndpointPayloadTracker}} is the only accessor of
{{globalRequestPayloadInFlight}}, however, it did not seem to make semantic
sense of moving {{globalRequestPayloadInFlight}} as a member of
{{EndpointPayloadTracker}}.
Regarding starting/stopping all channels for an endpoint at once, I agree it
would make the system better respect the limits than otherwise letting each
channel cross once (incase of backpressure mode). However, since each channel's
connection properties maybe different (THROW_ON_OVERLOAD vs BACKPRESSURE), we
will then have to check the property and make a decision accordingly. I am
inclining towards punting this for now. Let me know what you think.
Regarding per-client limit, I see it as a good logical extension to this patch,
and I would like to tackle it as a separate ticket. Also, as an additional
ticket, I would like to explore extending this concurrency limitting patch to
go beyond the current parameter of incoming payload which works well for
writes, but not necessarily for reads, maybe considering the response payload
and/or no. of concurrent tasks in flight.
was (Author: sumanth.pasupuleti):
Thanks for the additional feedback. I've made following further changes to the
[patch|https://github.com/sumanth-pasupuleti/cassandra/commit/98126f5d887228f5e88eca66f007873b52a0aacf]:
* Removed {{requestsProcessed}} metric.
* Added BackpressureDeployed metric to indicate how many times server attempted
to apply backpressure.
* EndpointPayloadTracker manages requestPayloadInFlightPerEndpoint, and an
external consumer of EndpointPayloadTracker calls a static get that internally
does tryRef
* EndpointAndGlobal.release returns ABOVE_LIMIT or BELOW_LIMIT
[Passing
Tests|https://circleci.com/workflow-run/561b655c-c3d8-4ea6-9997-ed24754ad133]
EndpointPayloadTracker is the only accessor of globalRequestPayloadInFlight,
however, it did not seem to make semantic sense of moving
globalRequestPayloadInFlight as a member of EndpointPayloadTracker.
Regarding starting/stopping all channels for an endpoint at once, I agree it
would make the system better respect the limits than otherwise letting each
channel cross once (incase of backpressure mode). However, since each channel's
connection properties maybe different (THROW_ON_OVERLOAD vs BACKPRESSURE), we
will then have to check the property and make a decision accordingly. I am
inclining towards punting this for now. Let me know what you think.
Regarding per-client limit, I see it as a good logical extension to this patch,
and I would like to tackle it as a separate ticket. Also, as an additional
ticket, I would like to explore extending this concurrency limitting patch to
go beyond the current parameter of incoming payload which works well for
writes, but not necessarily for reads, maybe considering the response payload
and/or no. of concurrent tasks in flight.
> Message Flusher queue can grow unbounded, potentially running JVM out of
> memory
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-15013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
> Project: Cassandra
> Issue Type: Bug
> Components: Messaging/Client
> Reporter: Sumanth Pasupuleti
> Assignee: Sumanth Pasupuleti
> Priority: Normal
> Labels: pull-request-available
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: BlockedEpollEventLoopFromHeapDump.png,
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue
> bounded, since, in the current state, items get added to the queue without
> any checks on queue size, nor with any checks on netty outbound buffer to
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]