[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512166#comment-14512166
]
Michael Kjellman edited comment on CASSANDRA-8789 at 4/25/15 1:41 AM:
----------------------------------------------------------------------
I just tried the following.
Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all
100k records without issue) and then apply
828496492c51d7437b690999205ecc941f41a0a9 and
144644bbf77a546c45db384e2dbc18e13f65c9ce
I started seeing failures 1/3 of the way thru stress with messages like the
following in the logs
h4. ccm node1 showlog
{noformat}
WARN [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage
has 3 pending tasks; skipping status check (no nodes will be marked down)
INFO [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress
/127.0.0.1 is now DOWN
{noformat}
h4. ccm node2 showlog
{noformat}
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}
So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0
stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in
previous comment on this ticket) I was able to successfully run
cassandra-stress -l 3 against without failure.
was (Author: mkjellman):
I just tried the following.
Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all
100k records without issue) and then apply
828496492c51d7437b690999205ecc941f41a0a9 and
144644bbf77a546c45db384e2dbc18e13f65c9ce
I started seeing failures 1/3 of the way thru stress with messages like the
following in the logs
{noformat}
WARN [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage
has 3 pending tasks; skipping status check (no nodes will be marked down)
INFO [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress
/127.0.0.1 is now DOWN
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}
So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0
stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in
previous comment on this ticket) I was able to successfully run
cassandra-stress -l 3 against without failure.
> OutboundTcpConnectionPool should route messages to sockets by size not type
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-8789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8789
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Ariel Weisberg
> Assignee: Ariel Weisberg
> Fix For: 3.0
>
> Attachments: 8789.diff
>
>
> I was looking at this trying to understand what messages flow over which
> connection.
> For reads the request goes out over the command connection and the response
> comes back over the ack connection.
> For writes the request goes out over the command connection and the response
> comes back over the command connection.
> Reads get a dedicated socket for responses. Mutation commands and responses
> both travel over the same socket along with read requests.
> Sockets are used uni-directional so there are actually four sockets in play
> and four threads at each node (2 inbounded, 2 outbound).
> CASSANDRA-488 doesn't leave a record of what the impact of this change was.
> If someone remembers what situations were made better it would be good to
> know.
> I am not clear on when/how this is helpful. The consumer side shouldn't be
> blocking so the only head of line blocking issue is the time it takes to
> transfer data over the wire.
> If message size is the cause of blocking issues then the current design mixes
> small messages and large messages on the same connection retaining the head
> of line blocking.
> Read requests share the same connection as write requests (which are large),
> and write acknowledgments (which are small) share the same connections as
> write requests. The only winner is read acknowledgements.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)