[
https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839326#comment-16839326
]
Benedict commented on CASSANDRA-15013:
--------------------------------------
Thanks [~sumanth.pasupuleti]. I started reviewing your new changes, and part
way through realised we could potentially simplify this all a great deal with a
slightly different approach. Namely, if we were to hash endpoints to a
specific eventLoop when accepting the connection. If we were to do this, we
could have very simple per-thread accounting, and we could even aggregate all
of the per-endpoint channels into a single flusher for stopping/starting
together once they exceed their limits. Everything would be single threaded,
so our logic would be much simpler to reason about.
This isn't without its tradeoffs - potentially users might have a setup with a
single application node speaking to the cluster, but this would be a very
peculiar system design to pair with Cassandra, and a single dedicated eventLoop
for this node would still likely suffice for a majority of workloads. We also
have the potential issue of endpoint collisions, but if we use a cryptographic
hash function this should only be a problem for a very small number of nodes
(and if we ever find it is a real problem, we can remedy it)
What do you think? I'm sorry for moving the goal posts suddenly, it just
hadn't occurred to me until now. My goal is only the best patch, so I'm
interested to hear your thoughts.
> Message Flusher queue can grow unbounded, potentially running JVM out of
> memory
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-15013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
> Project: Cassandra
> Issue Type: Bug
> Components: Messaging/Client
> Reporter: Sumanth Pasupuleti
> Assignee: Sumanth Pasupuleti
> Priority: Normal
> Labels: pull-request-available
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: BlockedEpollEventLoopFromHeapDump.png,
> BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap
> dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue
> bounded, since, in the current state, items get added to the queue without
> any checks on queue size, nor with any checks on netty outbound buffer to
> check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]