[ 
https://issues.apache.org/jira/browse/CASSANDRA-6899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis resolved CASSANDRA-6899.
---------------------------------------

    Resolution: Not A Problem

The point is that the coordinator will have given up waiting for a reply and 
returned a timeout to the client.  If you want to throw really large batches or 
blobs around (a bad idea in the first place) you will need to increase the 
timeout.

> Don't include time to read a message in determining whether to drop message
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6899
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6899
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Oliver Seiler
>            Priority: Minor
>
> This came out of trying to understand why I was seeing a large number of 
> dropped (mutation) messages on an otherwise quiet test cluster that had 
> previously been run such that there were a large number of queued hints from 
> nodes in DC 1 to nodes in DC 2. The cluster version is Cassandra 2.0.4, 3 
> nodes in DC 1, 3 nodes in DC 2, with RF=3 for each DC. I think it's relevant 
> to mention that we've enabled the inter_dc_tcp_nodelay setting.
> Virtually no debug logging is done for dropped messages, so I had to dig down 
> into the source to try to figure out what is going on. It appears the message 
> is large enough that, combined with our enabling of the inter_dc_tcp_nodelay, 
> the time taken to read the read the message from the socket exceeds the 
> default 2 second write_request_timeout_in_ms setting used to determine 
> whether to drop mutation messages. Note that we don't see any dropped 
> messages in DC 1, which is why I believe this is related to 
> inter_dc_tcp_nodelay; because this is a test cluster, the two DCs are 
> actually on the same network (1GigE).
> The specific issue I'm raising here is in 
> org.apache.cassandra.net.IncomingTcpConnection::receiveMessage, which obtains 
> a timestamp before reading the message payload. This doesn't seem useful, 
> since at the point the message would get dropped (MessageDeliveryTask::run) 
> we've already read the message, queued it to the MutationStage thread pool 
> via MessageDeliveryTask, and have MessageDeliveryTask running. It isn't clear 
> to me why we'd want to include the time to read the message off the wire to 
> determine whether the thread pool is backlogging, since in this case the 
> thread pool *isn't* backlogging at all. In fact, once in this state, not much 
> is going to allow the message to get processed (short of a configuration 
> change), resulting in the message being re-sent every ten minutes; in this 
> case a 'nodetool repair' was required to clear out the hints.
> Am I missing something in this? It seems intentional in 
> IncomingTcpConnection, given the way that the cross_node_timeout setting is 
> used, and clearly we shouldn't be generating large messages like this, but it 
> doesn't seem useful to have logic that results in messages being dropped when 
> there isn't actually any load-related reason for doing so.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to