[ 
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503945#comment-14503945
 ] 

Michael Kjellman commented on CASSANDRA-8789:
---------------------------------------------

My testing has shown that relying on message size as a heuristic to determine 
the channel/socket to write to has adverse effects under load. The problem is 
this mixes high priority "Command" verbs (e.g 
GOSSIP_DIGEST_SYN/GOSSIP_DIGEST_ACK) - that cannot be delayed in any way due to 
the current implementation of FailureDetector - with lower priority 
"Response/Data" (e.g MUTATION/READ/REQUEST_RESPONSE) verbs. The effect of this 
is that nodes will flap and be considered incorrectly DOWN due to failure in 
sending Gossip verbs which are now queued behind lower priority messages.

The implementation of MessagingService is "fire and forget", however we do 
expect for most messages some form of ACK. For instance, each MUTATION expects 
a REQUEST_RESPONSE within a given timeout; otherwise a hint is generated. Here 
lies the problem: the REQUEST_RESPONSE verb is 6 bytes (with no payload -- so 
now considered "small"). We also have INTERNAL_RESPONSE (also 6 bytes). By 
using size instead of priority, or the old hard coded Command/Data 
implementation, (sending high priority messages like GOSSIP over one channel 
and normal/low priority messages over another) this means the REQUEST_RESPONSE 
for each MUTATION after this change will now be sent over the same channel that 
used to be reserved for GOSSIP (or other high priority Command) verbs.

If the kernel buffers backup sufficiently (although we have the NO_DELAY option 
on the socket, it isn't very difficult under moderate/high load to still 
saturate the NIC) we've now moved an ACK message for every MUTATION onto the 
same socket that is sending GOSSIP messages. Eventually if we backup with 
enough small messages we likely will end up unable to send *important* messages 
(e.g GOSSIP_DIGEST_SYN/GOSSIP_DIGEST_ACK), and FD will falsely be triggered and 
nodes will be marked DOWN incorrectly. Additionally, once we hit this 
condition, we end up flapping as GOSSIP messages eventually get thru which 
compounds the problem.

h4. How to reproduce:
I'm unable to figure out the new stress so I ran the stress from 2.0 against 
trunk (commit sha 1fab7b785dc5e440a773828ff17e927a1f3c2e5f from 4/20/15) with 
all defaults except for changing the replication factor from it's default of 1 
to 3. I'm pretty sure the reason I can't easily reproduce with the new stress 
is I seem to be failing to figure out the command line parsing to change it 
from the default of 8 threads back to the 30 threads default that was in the 
old stress. While it's crazy to run with 30 threads, this simulates enough 
traffic on my 2014 MacBook Pro to actually backup the kernel buffers on 
loopback which will trigger this.

1) Setup a 3 node ccm cluster locally with all defaults (ccm create tcptest 
--install-dir=/Users/username/pathto/cassandra-apache/ && ccm populate -n 3 && 
ccm start)
2) Run stress from 2.0 using all defaults aside from specifying a RF=3 
(tools/bin/cassandra-stress -l 3)
3) Monitor FailureDetector messages in the logs, overall load written, etc

h4. Expected Results:
# Without these changes, stress will not timeout while inserting data. With 
this change, I've now observed timeouts starting 50% of the way thru the 1 
million records. 
{noformat}
Operation [303198] retried 10 times - error inserting key 0303198 
((TTransportException): java.net.SocketException: Broken pipe)
{noformat}

# Although MUTATION messages should/are expected to be dropped under high load 
etc, GOSSIP messages should not fail in being written to the socket in a timely 
manner to avoid FD (FailureDetector) from incorrectly marking nodes DOWN 
incorrectly.
# Amount of inserted load reported in nodetool ring should be ~250MB using the 
2.0 stress tool. On my machine I saw a "final" load of 1.44MB on node(1), and 
only ~65MB on node(2,3). This is due to FD marking the nodes down and dropping 
mutations and creating hints. (Additionally, once in this state, memory 
overhead get's even worse as we generate unnecessary hints because in the prior 
design we were able to actually write to the socket.)

h4. Alternative Proposal
I'm 100% on board with using a more priority based system to better utilize the 
two channels/sockets we have. For instance: 
MUTATION(2),
READ_REPAIR(3),
REQUEST_RESPONSE(2),
REPLICATION_FINISHED(1),
INTERNAL_RESPONSE(1),
COUNTER_MUTATION(2),
GOSSIP_DIGEST_SYN(1),
GOSSIP_DIGEST_ACK(1),
GOSSIP_DIGEST_ACK2(1),

That way we can use the priorities to route small messages like SNAPSHOT, 
TRUNCATE, GOSSIP_DIGEST_SYN over the high-priority channel and the 
normal-priority messages over the other channel/socket.

> OutboundTcpConnectionPool should route messages to sockets by size not type
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8789
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8789
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 3.0
>
>         Attachments: 8789.diff
>
>
> I was looking at this trying to understand what messages flow over which 
> connection.
> For reads the request goes out over the command connection and the response 
> comes back over the ack connection.
> For writes the request goes out over the command connection and the response 
> comes back over the command connection.
> Reads get a dedicated socket for responses. Mutation commands and responses 
> both travel over the same socket along with read requests.
> Sockets are used uni-directional so there are actually four sockets in play 
> and four threads at each node (2 inbounded, 2 outbound).
> CASSANDRA-488 doesn't leave a record of what the impact of this change was. 
> If someone remembers what situations were made better it would be good to 
> know.
> I am not clear on when/how this is helpful. The consumer side shouldn't be 
> blocking so the only head of line blocking issue is the time it takes to 
> transfer data over the wire.
> If message size is the cause of blocking issues then the current design mixes 
> small messages and large messages on the same connection retaining the head 
> of line blocking.
> Read requests share the same connection as write requests (which are large), 
> and write acknowledgments (which are small) share the same connections as 
> write requests. The only winner is read acknowledgements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to