[ 
https://issues.apache.org/jira/browse/CASSANDRA-9558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577606#comment-14577606
 ] 

Benedict commented on CASSANDRA-9558:
-------------------------------------

bq. Evidently this is exactly what writeAndFlush does which is what the driver 
is using when coalescing is disabled, but i'll keep exploring alternatives

But it's not a choice between the two. There should absolutely be coalescing, 
and it should never be disabled. The question is if we should artificially 
delay our messages in order to coalesce more of them. On a client I cannot see 
it making sense to do so: on the server, we expect the server to have other 
useful work to do, to produce more responses that can be coalesced together. On 
a client, however, we should not make that assumption: if the client is 
synchronously waiting for a result, we're pointlessly delaying them (and cannot 
know if this is the case), whereas if they are asynchronously producing work, 
this will accumulate or not, completely independent of our delay, and after the 
first potentially more costly message the costs will reach a steady state, that 
the delay is unlikely to have any positive effect on.

The main idea of it on the server is that it permits the server to exhaust its 
current burst of messages (if possible), so that all messages that would 
naturally be grouped given the chance can be.

That all said, some basic back-of-envelope maths suggest this cannot 
sufficiently account for the problem in this case. That doesn't mean we 
shouldn't change it though, but it is unlikely to explain this ticket.

We should really try to profile the client and server, to establish which is 
the bottleneck, and where. It should not be the case that we need multiple 
threads to deal with this workload: we're effectively batching up to 300 of 
these messages together, with a single point-to-point high-bandwidth TCP 
connection. The fact that this cannot cope with more than 7MB/s is crazy. There 
is maximal amortization of costs. It is possible we're hitting another weird 
issue with interrupt queues in AWS.

> Cassandra-stress regression in 2.2
> ----------------------------------
>
>                 Key: CASSANDRA-9558
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9558
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Alan Boudreault
>            Priority: Blocker
>         Attachments: 2.1.log, 2.2.log, CASSANDRA-9558-2.patch, 
> CASSANDRA-9558-ProtocolV2.patch, atolber-CASSANDRA-9558-stress.tgz, 
> atolber-trunk-driver-coalescing-disabled.txt, 
> stress-2.1-java-driver-2.0.9.2.log, stress-2.1-java-driver-2.2+PATCH.log, 
> stress-2.1-java-driver-2.2.log, stress-2.2-java-driver-2.2+PATCH.log, 
> stress-2.2-java-driver-2.2.log
>
>
> We are seeing some regression in performance when using cassandra-stress 2.2. 
> You can see the difference at this url:
> http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=stress_regression.json&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=108.57&ymin=0&ymax=168147.1
> The cassandra version of the cluster doesn't seem to have any impact. 
> //cc [~tjake] [~benedict]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to