Oliver Seiler created CASSANDRA-6899:
----------------------------------------

             Summary: Don't include time to read a message in determining 
whether to drop message
                 Key: CASSANDRA-6899
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6899
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Oliver Seiler
            Priority: Minor


This came out of trying to understand why I was seeing a large number of 
dropped (mutation) messages on an otherwise quiet test cluster that had 
previously been run such that there were a large number of queued hints from 
nodes in DC 1 to nodes in DC 2. The cluster version is Cassandra 2.0.4, 3 nodes 
in DC 1, 3 nodes in DC 2, with RF=3 for each DC. I think it's relevant to 
mention that we've enabled the inter_dc_tcp_nodelay setting.

Virtually no debug logging is done for dropped messages, so I had to dig down 
into the source to try to figure out what is going on. It appears the message 
is large enough that, combined with our enabling of the inter_dc_tcp_nodelay, 
the time taken to read the read the message from the socket exceeds the default 
2 second write_request_timeout_in_ms setting used to determine whether to drop 
mutation messages. Note that we don't see any dropped messages in DC 1, which 
is why I believe this is related to inter_dc_tcp_nodelay; because this is a 
test cluster, the two DCs are actually on the same network (1GigE).

The specific issue I'm raising here is in 
org.apache.cassandra.net.IncomingTcpConnection::receiveMessage, which obtains a 
timestamp before reading the message payload. This doesn't seem useful, since 
at the point the message would get dropped (MessageDeliveryTask::run) we've 
already read the message, queued it to the MutationStage thread pool via 
MessageDeliveryTask, and have MessageDeliveryTask running. It isn't clear to me 
why we'd want to include the time to read the message off the wire to determine 
whether the thread pool is backlogging, since in this case the thread pool 
*isn't* backlogging at all. In fact, once in this state, not much is going to 
allow the message to get processed (short of a configuration change), resulting 
in the message being re-sent every ten minutes; in this case a 'nodetool 
repair' was required to clear out the hints.

Am I missing something in this? It seems intentional in IncomingTcpConnection, 
given the way that the cross_node_timeout setting is used, and clearly we 
shouldn't be generating large messages like this, but it doesn't seem useful to 
have logic that results in messages being dropped when there isn't actually any 
load-related reason for doing so.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to