[
https://issues.apache.org/jira/browse/CASSANDRA-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504931#comment-14504931
]
Benedict commented on CASSANDRA-7029:
-------------------------------------
QUIC is exactly the kind of thing I'm talking about, although it may be more
complex than we need (and we may not want to wait for a Java implementation to
appear for general consumption).
bq. Do we even have an idea of how much overhead the transport is for us? If I
were to guess I'd say probably around 10% which is in "not worth optimizing
yet" territory for me.
I've yet to obtain datapoints on a server, but when benchmarking CASSANDRA-4718
locally I found networking calls to be a really significant portion of the CPU
time (>30%) for the small in-memory workloads we were testing. It is likely
this would be lower on server-grade hardware since some of the work would be
offloaded to the NIC, but: we know we need many netty threads to reach peak
performance (most likely for interrupt queues, but it seems also because the
cost of the kernel calls to manage connection states overload a single thread)
- if we can drive the entire networking of the box from a single thread (which
should be possible over a UDP protocol) I am optimistic we could see really
significant dividends.
The problem with that is that the highest yield improvements for benchmarking
are likely to come by replacing the client connections, since we generally test
small clusters, and refactoring that to support multiple protocols is trickier
than for MessagingService. It is probably worth doing, though, so that we can
easily explore all of the ideas we have for efficient transport.
> Investigate alternative transport protocols for both client and inter-server
> communications
> -------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-7029
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7029
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Labels: performance
> Fix For: 3.0
>
>
> There are a number of reasons to think we can do better than TCP for our
> communications:
> 1) We can actually tolerate sporadic small message losses, so guaranteed
> delivery isn't essential (although for larger messages it probably is)
> 2) As shown in \[1\] and \[2\], Linux can behave quite suboptimally with
> regard to TCP message delivery when the system is under load. Judging from
> the theoretical description, this is likely to apply even when the
> system-load is not high, but the number of processes to schedule is high.
> Cassandra generally has a lot of threads to schedule, so this is quite
> pertinent for us. UDP performs substantially better here.
> 3) Even when the system is not under load, UDP has a lower CPU burden, and
> that burden is constant regardless of the number of connections it processes.
> 4) On a simple benchmark on my local PC, using non-blocking IO for UDP and
> busy spinning on IO I can actually push 20-40% more throughput through
> loopback (where TCP should be optimal, as no latency), even for very small
> messages. Since we can see networking taking multiple CPUs' worth of time
> during a stress test, using a busy-spin for ~100micros after last message
> receipt is almost certainly acceptable, especially as we can (ultimately)
> process inter-server and client communications on the same thread/socket in
> this model.
> 5) We can optimise the threading model heavily: since we generally process
> very small messages (200 bytes not at all implausible), the thread signalling
> costs on the processing thread can actually dramatically impede throughput.
> In general it costs ~10micros to signal (and passing the message to another
> thread for processing in the current model requires signalling). For 200-byte
> messages this caps our throughput at 20MB/s.
> I propose to knock up a highly naive UDP-based connection protocol with
> super-trivial congestion control over the course of a few days, with the only
> initial goal being maximum possible performance (not fairness, reliability,
> or anything else), and trial it in Netty (possibly making some changes to
> Netty to mitigate thread signalling costs). The reason for knocking up our
> own here is to get a ceiling on what the absolute limit of potential for this
> approach is. Assuming this pans out with performance gains in C* proper, we
> then look to contributing to/forking the udt-java project and see how easy it
> is to bring performance in line with what we can get with our naive approach
> (I don't suggest starting here, as the project is using blocking old-IO, and
> modifying it with latency in mind may be challenging, and we won't know for
> sure what the best case scenario is).
> \[1\]
> http://test-docdb.fnal.gov/0016/001648/002/Potential%20Performance%20Bottleneck%20in%20Linux%20TCP.PDF
> \[2\]
> http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=1968;filename=Performance%20Analysis%20of%20Linux%20Networking%20-%20Packet%20Receiving%20(Official).pdf;version=2
> Further related reading:
> http://public.dhe.ibm.com/software/commerce/doc/mft/cdunix/41/UDTWhitepaper.pdf
> https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/14482/ChoiUndPerTcp.pdf?sequence=1
> https://access.redhat.com/site/documentation/en-US/JBoss_Enterprise_Web_Platform/5/html/Administration_And_Configuration_Guide/jgroups-perf-udpbuffer.html
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.3762&rep=rep1&type=pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)