[ 
https://issues.apache.org/jira/browse/CASSANDRA-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660739#comment-13660739
 ] 

Sylvain Lebresne commented on CASSANDRA-5422:
---------------------------------------------

Thanks a lot Daniel for taking the time to look into that.

I was curious to understand from where the main benefits were coming from so I 
tried benching the optimizations separately.

First, the baseline on my machine (a quad-core i5 2.80Ghz) with the current 
java driver and C* 1.2 (with ModificationStatement execute commented out) is 
about 66K req/s (I'll note that I've already committed some patch to remove the 
contention on returnConnection in the Java driver and it's included in that 
baseline). That's with 50 threads (and it's slightly worth with 500 threads).

Long story short, the first bottleneck is the Java driver "stress" application. 
 Which I can't say is a surprise since it was a fairly quick hack primarily 
meant to check the driver wasn't crashing with more than one thread. 
[~danielnorberg], I'm happy committing your patch optimizing this, though the 
patch removes the Apache license from one file and adds some copyright, so 
wondering if the patches were meant for inclusion or not?

Anyway, even with the stress patch committed, I don't get much improvement yet. 
 More precisely, by default (synchronous mode, 50 threads) I get 74K, which is 
slightly better but not amazing. If I try the async mode with 500 threads (to 
compare with what's coming next), I actually get about 49K.

At this point, the main bottleneck by far seems to be the ArrayBlockingQueue 
used in the RequestThreadPoolExecutor. Changing it to LinkedBlockingQueue, we 
get 163K with stress in async mode and 500 threads (which is then the fastest 
mode: in synchronous mode, I get 95K with 50 threads and 117K with 500 threads).

So, I've committed that part (to 1.2) since that's such a trivial patch and is 
clearly the main bottleneck, at least Cassandra side. On trunk, I'll note that 
if we go ahead with CASSANDRA-5239, it'll remove RequestThreadPoolExecutor 
altogether which could improve things even more (though it's possible that once 
switched to LinkedBlockingQueue, it's not much of a bottleneck anymore).

bq. Expensive serialization, i.e. multiple layers of ChannelBuffers used in the 
ExecuteMessage codec.

The vague rational here was to avoid a copy of the values (when they are not 
trivially small). I did tried to quickly bench that patch separately (on top of 
the other optims) and didn't really saw a difference. Though I didn't saw much 
difference increasing the value size tbh (could be there is some other 
bottleneck, like the generation of bigger values for instance, I haven't 
checked). In any case, before changing the serialization of all messages it's 
probably worth some more thorough investigation. But I'm not sure we have a ton 
to win here, if any.

bq. No write batching

I agree that write batching is a good idea. That being said, write batching is 
often a trade-off between throughput and latency, so ideally I'd like to expose 
some of the tweaking knobs and/or test it on more realistic and varied scenario.

That being said, testing it (both client and server side) on top of the 
ABQ->LBQ patch, I get 180K req/s (versus 163K), so about 10% improvement on 
that test which ain't bad.

As a side note, same question on the license/copyright for the batching parts 
of the patch than above.

                
> Binary protocol sanity check
> ----------------------------
>
>                 Key: CASSANDRA-5422
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5422
>             Project: Cassandra
>          Issue Type: Bug
>          Components: API
>            Reporter: Jonathan Ellis
>            Assignee: Daniel Norberg
>         Attachments: 5422-test.txt
>
>
> With MutationStatement.execute turned into a no-op, I only get about 33k 
> insert_prepared ops/s on my laptop.  That is: this is an upper bound for our 
> performance if Cassandra were infinitely fast, limited by netty handling the 
> protocol + connections.
> This is up from about 13k/s with MS.execute running normally.
> ~40% overhead from netty seems awfully high to me, especially for 
> insert_prepared where the return value is tiny.  (I also used 4-byte column 
> values to minimize that part as well.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to