[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503945#comment-14503945 ]
Michael Kjellman commented on CASSANDRA-8789: --------------------------------------------- My testing has shown that relying on message size as a heuristic to determine the channel/socket to write to has adverse effects under load. The problem is this mixes high priority "Command" verbs (e.g GOSSIP_DIGEST_SYN/GOSSIP_DIGEST_ACK) - that cannot be delayed in any way due to the current implementation of FailureDetector - with lower priority "Response/Data" (e.g MUTATION/READ/REQUEST_RESPONSE) verbs. The effect of this is that nodes will flap and be considered incorrectly DOWN due to failure in sending Gossip verbs which are now queued behind lower priority messages. The implementation of MessagingService is "fire and forget", however we do expect for most messages some form of ACK. For instance, each MUTATION expects a REQUEST_RESPONSE within a given timeout; otherwise a hint is generated. Here lies the problem: the REQUEST_RESPONSE verb is 6 bytes (with no payload -- so now considered "small"). We also have INTERNAL_RESPONSE (also 6 bytes). By using size instead of priority, or the old hard coded Command/Data implementation, (sending high priority messages like GOSSIP over one channel and normal/low priority messages over another) this means the REQUEST_RESPONSE for each MUTATION after this change will now be sent over the same channel that used to be reserved for GOSSIP (or other high priority Command) verbs. If the kernel buffers backup sufficiently (although we have the NO_DELAY option on the socket, it isn't very difficult under moderate/high load to still saturate the NIC) we've now moved an ACK message for every MUTATION onto the same socket that is sending GOSSIP messages. Eventually if we backup with enough small messages we likely will end up unable to send *important* messages (e.g GOSSIP_DIGEST_SYN/GOSSIP_DIGEST_ACK), and FD will falsely be triggered and nodes will be marked DOWN incorrectly. Additionally, once we hit this condition, we end up flapping as GOSSIP messages eventually get thru which compounds the problem. h4. How to reproduce: I'm unable to figure out the new stress so I ran the stress from 2.0 against trunk (commit sha 1fab7b785dc5e440a773828ff17e927a1f3c2e5f from 4/20/15) with all defaults except for changing the replication factor from it's default of 1 to 3. I'm pretty sure the reason I can't easily reproduce with the new stress is I seem to be failing to figure out the command line parsing to change it from the default of 8 threads back to the 30 threads default that was in the old stress. While it's crazy to run with 30 threads, this simulates enough traffic on my 2014 MacBook Pro to actually backup the kernel buffers on loopback which will trigger this. 1) Setup a 3 node ccm cluster locally with all defaults (ccm create tcptest --install-dir=/Users/username/pathto/cassandra-apache/ && ccm populate -n 3 && ccm start) 2) Run stress from 2.0 using all defaults aside from specifying a RF=3 (tools/bin/cassandra-stress -l 3) 3) Monitor FailureDetector messages in the logs, overall load written, etc h4. Expected Results: # Without these changes, stress will not timeout while inserting data. With this change, I've now observed timeouts starting 50% of the way thru the 1 million records. {noformat} Operation [303198] retried 10 times - error inserting key 0303198 ((TTransportException): java.net.SocketException: Broken pipe) {noformat} # Although MUTATION messages should/are expected to be dropped under high load etc, GOSSIP messages should not fail in being written to the socket in a timely manner to avoid FD (FailureDetector) from incorrectly marking nodes DOWN incorrectly. # Amount of inserted load reported in nodetool ring should be ~250MB using the 2.0 stress tool. On my machine I saw a "final" load of 1.44MB on node(1), and only ~65MB on node(2,3). This is due to FD marking the nodes down and dropping mutations and creating hints. (Additionally, once in this state, memory overhead get's even worse as we generate unnecessary hints because in the prior design we were able to actually write to the socket.) h4. Alternative Proposal I'm 100% on board with using a more priority based system to better utilize the two channels/sockets we have. For instance: MUTATION(2), READ_REPAIR(3), REQUEST_RESPONSE(2), REPLICATION_FINISHED(1), INTERNAL_RESPONSE(1), COUNTER_MUTATION(2), GOSSIP_DIGEST_SYN(1), GOSSIP_DIGEST_ACK(1), GOSSIP_DIGEST_ACK2(1), That way we can use the priorities to route small messages like SNAPSHOT, TRUNCATE, GOSSIP_DIGEST_SYN over the high-priority channel and the normal-priority messages over the other channel/socket. > OutboundTcpConnectionPool should route messages to sockets by size not type > --------------------------------------------------------------------------- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Ariel Weisberg > Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)