[jira] [Comment Edited] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

Christian Esken (JIRA) Thu, 13 Apr 2017 04:59:55 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967484#comment-15967484
 ]


Christian Esken edited comment on CASSANDRA-13265 at 4/13/17 11:59 AM:
-----------------------------------------------------------------------

There were different reasons why the build failed, e.g. somehow Eclipse did not 
pick up the build parameters for 2.2 after "ant generate-eclipse-files" and the 
build was done with Java 8 language level (lambdas). Looks like building and 
testing in Eclipse alone is not enough, so I redid everything manually in the 
console and fixed the issues. As you recommended, I have created branches that 
follow your naming  (cassandra-13625-3.0) with squashed commits. The new 
branches are:

https://github.com/christian-esken/cassandra/commits/cassandra-13625-2.2
https://github.com/christian-esken/cassandra/commits/cassandra-13625-3.11
https://github.com/christian-esken/cassandra/commits/cassandra-13625-3.0
https://github.com/christian-esken/cassandra/commits/cassandra-13625-trunk

About CHANGES.TXT: I added changes in the "matching" release versions that were 
listed in the individual branches. Please check, as the naming conventions 
within Cassandra are still not clear to me (e.g. there exists a 3.11 branch, a 
3.0.11 release and a 3.11.0 changelog entry).


was (Author: cesken):
There were different reasons why the build failed, e.g. somehow Eclipse did not 
pick up the build parameters for 2.2 after "ant generate-eclipse-files" and the 
build was done with Java 8 language level (lambdas). Looks like building and 
testing in Eclipse alone is not enough, so I redid everything manually in the 
console and fixed the issues. As you recommended, I have created branches that 
follow your naming  (cassandra-13625-3.0) with squashed commits. The new 
branches are:

https://github.com/christian-esken/cassandra/commits/cassandra-13625-2.2
https://github.com/christian-esken/cassandra/commits/cassandra-13625-3.11
https://github.com/christian-esken/cassandra/commits/cassandra-13625-3.0
https://github.com/christian-esken/cassandra/commits/cassandra-13625-trunk

About CHANGES.TXT: I added changes to  all branches where in the appropriate 
versions. Please check, as the naming conventions within Cassandra are still 
not clear to me(e.g. there exists a 3.11 branch, a 3.0.11 release and a 3.11.0 
changelog entry).

> Expiration in OutboundTcpConnection can block the reader Thread
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-13265
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>            Reporter: Christian Esken
>            Assignee: Christian Esken
>             Fix For: 3.0.x
>
>         Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -----
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (100000 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

Reply via email to