[jira] [Updated] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Jirsa updated CASSANDRA-13265: --- Resolution: Duplicate Status: Resolved (was: Awaiting Feedback) Closing > Communication breakdown in OutboundTcpConnection > > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 >Reporter: Christian Esken >Assignee: Christian Esken > Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, > cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz > > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain > amount of queued messages, it starts thrashing itself to death. Each of the > Thread fully locks the Queue for reading and writing by calling > iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > - > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (10 INSERT statements per second and more during > peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-13265: --- Status: Awaiting Feedback (was: Open) > Communication breakdown in OutboundTcpConnection > > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 >Reporter: Christian Esken >Assignee: Christian Esken > Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, > cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz > > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain > amount of queued messages, it starts thrashing itself to death. Each of the > Thread fully locks the Queue for reading and writing by calling > iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > - > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (10 INSERT statements per second and more during > peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-13265: --- Reviewer: Ariel Weisberg > Communication breakdown in OutboundTcpConnection > > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 >Reporter: Christian Esken >Assignee: Christian Esken > Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, > cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz > > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain > amount of queued messages, it starts thrashing itself to death. Each of the > Thread fully locks the Queue for reading and writing by calling > iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > - > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (10 INSERT statements per second and more during > peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Esken updated CASSANDRA-13265: Attachment: cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz Thread Dump > Communication breakdown in OutboundTcpConnection > > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 >Reporter: Christian Esken > Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, > cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz > > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain > amount of queued messages, it starts thrashing itself to death. Each of the > Thread fully locks the Queue for reading and writing by calling > iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > - > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (10 INSERT statements per second and more during > peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Esken updated CASSANDRA-13265: Attachment: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz Class Histogram > Communication breakdown in OutboundTcpConnection > > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 >Reporter: Christian Esken > Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, > cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz > > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain > amount of queued messages, it starts thrashing itself to death. Each of the > Thread fully locks the Queue for reading and writing by calling > iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > - > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (10 INSERT statements per second and more during > peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Esken updated CASSANDRA-13265: Description: I observed that sometimes a single node in a Cassandra cluster fails to communicate to the other nodes. This can happen at any time, during peak load or low load. Restarting that single node from the cluster fixes the issue. Before going in to details, I want to state that I have analyzed the situation and am already developing a possible fix. Here is the analysis so far: - A Threaddump in this situation showed 324 Threads in the OutboundTcpConnection class that want to lock the backlog queue for doing expiration. - A class histogram shows 262508 instances of OutboundTcpConnection$QueuedMessage. What is the effect of it? As soon as the Cassandra node has reached a certain amount of queued messages, it starts thrashing itself to death. Each of the Thread fully locks the Queue for reading and writing by calling iterator.next(), making the situation worse and worse. - Writing: Only after 262508 locking operation it can progress with actually writing to the Queue. - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and fully lock the Queue This means: Writing blocks the Queue for reading, and readers might even be starved which makes the situation even worse. - The setup is: - 3-node cluster - replication factor 2 - Consistency LOCAL_ONE - No remote DC's - high write throughput (10 INSERT statements per second and more during peak times). was: I observed that sometimes a single node in a Cassandra cluster fails to communicate to the other nodes. This can happen at any time, during peak load or low load. Restarting that single node from the cluster fixes the issue. Before going in to details, I want to state that I have analyzed the situation and am already developing a possible fix. Here is the analysis so far: - A Threaddump in this situation showed 324 Threads in the OutboundTcpConnection class that want to lock the backlog queue for doing expiration. - A class histogram shows 262508 instances of OutboundTcpConnection$QueuedMessage. What is the effect of it? As soon as the Cassandra node has reached that state, it never gets out of it by itself, it is thrashing itself to death instead, as each of the Thread fully locks the Queue for reading and writing by calling iterator.next(). - Writing: Only after 262508 locking operation it can progress with actually writing to the Queue. - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and fully lock the Queue This means: Writing blocks the Queue for reading, and readers might even be starved which makes the situation even worse. - The setup is: - 3-node cluster - replication factor 2 - Consistency LOCAL_ONE - No remote DC's - high write throughput (10 INSERT statements per second and more during peak times). > Communication breakdown in OutboundTcpConnection > > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 >Reporter: Christian Esken > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain > amount of queued messages, it starts thrashing itself to death. Each of the > Thread fully locks the Queue for reading and writing by calling > iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > - > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (10 INSERT statements per second and more during > peak times). > -- This message was sent by
[jira] [Updated] (CASSANDRA-13265) Communication breakdown in OutboundTcpConnection
[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Esken updated CASSANDRA-13265: Description: I observed that sometimes a single node in a Cassandra cluster fails to communicate to the other nodes. This can happen at any time, during peak load or low load. Restarting that single node from the cluster fixes the issue. Before going in to details, I want to state that I have analyzed the situation and am already developing a possible fix. Here is the analysis so far: - A Threaddump in this situation showed 324 Threads in the OutboundTcpConnection class that want to lock the backlog queue for doing expiration. - A class histogram shows 262508 instances of OutboundTcpConnection$QueuedMessage. What is the effect of it? As soon as the Cassandra node has reached that state, it never gets out of it by itself, it is thrashing itself to death instead, as each of the Thread fully locks the Queue for reading and writing by calling iterator.next(). - Writing: Only after 262508 locking operation it can progress with actually writing to the Queue. - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and fully lock the Queue This means: Writing blocks the Queue for reading, and readers might even be starved which makes the situation even worse. - The setup is: - 3-node cluster - replication factor 2 - Consistency LOCAL_ONE - No remote DC's - high write throughput (10 INSERT statements per second and more during peak times). was: I observed that sometimes a single node in a Cassandra cluster fails to communicate to the other nodes. This can happen at any time, during peak load or low load. Restarting that single node from the cluster fixes the issue. Before going in to details, I want to state that I have analyzed the situation and am already developing a possible fix. Here is the analysis so far: - A Threaddump in this situation showed that 324 Threads in the OutboundTcpConnection class wanted to lock the backlog queue for doing expiration. - A class histogram shows 262508 instances of OutboundTcpConnection$QueuedMessage. What is the effect of it? As soon as the Cassandra node has reached that state, it never gets out of it by itself, it is thrashing itself to death instead, as each of the Thread fully locks the Queue for reading and writing by calling iterator.next(). - Writing: Only after 262508 locking operation it can progress with actually writing to the Queue. - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and fully lock the Queue This means: Writing blocks the Queue for reading, and readers might even be starved which makes the situation even worse. - The setup is: - 3-node cluster - replication factor 2 - Consistency LOCAL_ONE - No remote DC's - high write throughput (10 INSERT statements per second and more during peak times). > Communication breakdown in OutboundTcpConnection > > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 >Reporter: Christian Esken > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached that > state, it never gets out of it by itself, it is thrashing itself to death > instead, as each of the Thread fully locks the Queue for reading and writing > by calling iterator.next(). > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > - > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (10 INSERT statements per second and more during > peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)