[
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883446#comment-15883446
]
Ariel Weisberg commented on CASSANDRA-13265:
--------------------------------------------
Thanks for catching this. I am not clear on you end up with so many threads
blocked. There shouldn't be that many threads since only 128 request threads
can be in flight in a default configuration. What are these threads and what
pool are they coming from?
This fix creates a thread to do the scheduled work per outbound TCP connection
and unfortunately we have a lot of them. For a typical thousand node cluster we
will create 6k threads (3 inbound, 3 outbound) and this would add another 3k.
I think the ideal fix would commit a bounded number of threads to processing
expiration work. It would also only attempt to expire on a fixed interval so
excessive time isn't spent checking for expirations that won't occur because no
time has passed.
Initially I was going to suggest another thread pool based approach to handle
expiration, but I don't see much to be gained doing expiration in a separate
thread. Certainly it would reduce outliers and be more non-blocking. You could
do the expiration work in the thread submitting a message and just take care
that only one thread does expiration work every X milliseconds/seconds.
So the general requirements as I see it whether you use another thread or not.:
If there is currently no expiration processing task for a connection and it
looks like we should expire then we should atomically perform expiration. It
would be okay to occasionally miss executing expiration since the next message
will have an opportunity to perform expiration. We should not perform
expiration if the time since the last expiration finished is < some threshold
that is configurable via cassandra.yaml and JMX (also queryable via JMX).
If we do decide to use a thread pool I think you want allocate a bounded thread
pool that is not that big. Maybe min(availableProcessors, 4). Configurable via
a property.
This also needs to be done on 2.1, 2.2, 3.0, 3.11, and trunk, but to start just
get it done in one version so we can get the design right.
> Communication breakdown in OutboundTcpConnection
> ------------------------------------------------
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
> Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version
> 1.8.0_112-b15)
> Linux 3.16
> Reporter: Christian Esken
> Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz,
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to
> communicate to the other nodes. This can happen at any time, during peak load
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the
> situation and am already developing a possible fix. Here is the analysis so
> far:
> - A Threaddump in this situation showed 324 Threads in the
> OutboundTcpConnection class that want to lock the backlog queue for doing
> expiration.
> - A class histogram shows 262508 instances of
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain
> amount of queued messages, it starts thrashing itself to death. Each of the
> Thread fully locks the Queue for reading and writing by calling
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be
> starved which makes the situation even worse.
> -----
> The setup is:
> - 3-node cluster
> - replication factor 2
> - Consistency LOCAL_ONE
> - No remote DC's
> - high write throughput (100000 INSERT statements per second and more during
> peak times).
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)