Oliver Seiler created CASSANDRA-6899:
----------------------------------------
Summary: Don't include time to read a message in determining
whether to drop message
Key: CASSANDRA-6899
URL: https://issues.apache.org/jira/browse/CASSANDRA-6899
Project: Cassandra
Issue Type: Improvement
Reporter: Oliver Seiler
Priority: Minor
This came out of trying to understand why I was seeing a large number of
dropped (mutation) messages on an otherwise quiet test cluster that had
previously been run such that there were a large number of queued hints from
nodes in DC 1 to nodes in DC 2. The cluster version is Cassandra 2.0.4, 3 nodes
in DC 1, 3 nodes in DC 2, with RF=3 for each DC. I think it's relevant to
mention that we've enabled the inter_dc_tcp_nodelay setting.
Virtually no debug logging is done for dropped messages, so I had to dig down
into the source to try to figure out what is going on. It appears the message
is large enough that, combined with our enabling of the inter_dc_tcp_nodelay,
the time taken to read the read the message from the socket exceeds the default
2 second write_request_timeout_in_ms setting used to determine whether to drop
mutation messages. Note that we don't see any dropped messages in DC 1, which
is why I believe this is related to inter_dc_tcp_nodelay; because this is a
test cluster, the two DCs are actually on the same network (1GigE).
The specific issue I'm raising here is in
org.apache.cassandra.net.IncomingTcpConnection::receiveMessage, which obtains a
timestamp before reading the message payload. This doesn't seem useful, since
at the point the message would get dropped (MessageDeliveryTask::run) we've
already read the message, queued it to the MutationStage thread pool via
MessageDeliveryTask, and have MessageDeliveryTask running. It isn't clear to me
why we'd want to include the time to read the message off the wire to determine
whether the thread pool is backlogging, since in this case the thread pool
*isn't* backlogging at all. In fact, once in this state, not much is going to
allow the message to get processed (short of a configuration change), resulting
in the message being re-sent every ten minutes; in this case a 'nodetool
repair' was required to clear out the hints.
Am I missing something in this? It seems intentional in IncomingTcpConnection,
given the way that the cross_node_timeout setting is used, and clearly we
shouldn't be generating large messages like this, but it doesn't seem useful to
have logic that results in messages being dropped when there isn't actually any
load-related reason for doing so.
--
This message was sent by Atlassian JIRA
(v6.2#6252)