[Gluster-devel] Jeff Darcy's objections to multi-thread-epoll and proposal to use own-thread alternative

Ben England Tue, 14 Oct 2014 07:24:58 -0700

This e-mail is specifically about use of multi-thread-epoll optimization 
(originally prototyped by Anand Avati) to solve a Gluster performance problem: 
single-threaded reception of protocol messages (for non-SSL sockets), and 
consequent inability to fully utilize available CPU on server.  A discussion of 
its pros and cons follows, along with the alternative to it suggested by Jeff 
Darcy, referred to as "own-thread" below.  Thanks to Shyam Ranganathan for 
helping me to clarify my thoughts on this.  Attached is some performance data 
about multi-thread-epoll.


To see why this threading discussion matters, consider that storage hardware 
encountered in the enterprise server world is rapidly speeding up with new 
hardware such as 40-Gbps networks and SSDs, but CPUs are not speeding up nearly 
as much.  Instead, we have more cores per socket.  So adequate performance for 
Gluster will require use of sufficient threads to match CPU throughput to 
network and storage.  

One way to get the server's idle CPU horsepower engaged is JBOD (just a bunch 
of disks, no RAID) - since there is one glusterfsd, hence 1 epoll thread per 
brick (disk).   This causes scalability problems for small-file creates 
(cluster.lookup-unhashed=on is default), and it limits throughput of an 
individual file to the speed of the disk drive, so until these problems are 
addressed, the utility of JBOD approach is limited.

----- Original Message -----
> From: "Jeff Darcy" <[email protected]>
> To: "Gluster Devel" <[email protected]>
> Sent: Wednesday, October 8, 2014 4:20:34 PM
> Subject: [Gluster-devel] jdarcy status (October 2014)
> 
> Multi-threading is even more controversial.  It has also been in the
> tree for two years (it was developed to address the problem of SSL code
> slowing down our entire transport stack).  This feature, controlled by
> the "own-thread" transport option, uses a thread per connection - not my
> favorite concurrency model, but kind of necessary to deal with the
> OpenSSL API.  More recently, a *completely separate* approach to
> multi-threading - "multi-threaded epoll" - has been getting some
> attention.  Here's what I see as the pros and cons of this new approach.
> 
>  * PRO: greater parallelism of requests on a single connection.  I think
>    the actual performance benefits vs. own-thread are unproven and
>    likely to be small, but they're real.
>

We should try comparing performance of multi-thread-epoll to own-thread, 
shouldn't be hard to hack own-thread into non-SSL-socket case.  

HOWEVER, if "own-thread" implies a thread per network connection, as you scale 
out a Gluster volume with N bricks, you have O(N) clients, and therefore you 
have O(N) threads on each glusterfsd (libgfapi adoption would make it far 
worse)!  Suppose we are implementing a 64-brick configuration with 200 clients, 
not an unreasonably sized Gluster volume for a scalable filesystem.   We then 
have 200 threads per Glusterfsd just listening for RPC messages on each brick.  
On a 60-drive server there can be a lot more than 1 brick per server, so 
multiply threads/glusterfsd by brick count!  It doesn't make sense to have 
total threads >= CPUs, and modern processors make context switching between 
threads more and more expensive.  

Shyam mentioned a refinement to own-thread where we equally partition the set 
of TCP connections among a pool of threads (own-thread is a special case of 
this).  This cannot supply an individual client with more than 1 thread to 
receive RPCs, even when most of CPU cores on the server are idle.  Why impose 
this constraint (see below)?  To see why this is important, consider a common 
use case: KVM virtualization.  

SSDs require orders of magnitude more IOPS from glusterfsd and glusterfs than a 
traditional rotating disk.  So even if you dedicate a thread to a single 
network connection, this thread may still have trouble keeping up with the 
high-speed network and the SSD.  Multi-thread-epoll is the only proposal so far 
that offers a way to apply enough CPU to this problem.  Consider that some SSDs 
have throughput on the order of a million IOPS (I/O operations per second).  In 
the past, we have worked around this problem by placing multiple bricks on a 
single SSD, but this causes other problems (scalability, free space 
measurement).


>  * CON: with greater concurrency comes greater potential to uncover race
>    conditions in other modules used to being single-threaded.  We've
>    already seen this somewhat with own-thread, and we'd see it more with
>    multi-epoll.
> 

On the Gluster server side, because of the io-threads translator, an RPC 
listener thread is effectively just starting a worker thread and then going 
back to read another RPC.  With own-thread, although RPC requests are received 
in order, there is no guarantee that the requests will be processed in the 
order that they were received from the network.   On the client side, we have 
operations such as readdir that will fan out parallel FOPS.  If you use 
own-thread approach, then these parallel FOP replies can all be processed in 
parallel by the listener threads, so you get at least the same level of race 
condition that you would get with multi-thread-epoll.

>  * CON: multi-epoll does not work with SSL.  It *can't* work with
>    OpenSSL at all, short of adopting a hybrid model where SSL
>    connections use own-thread while others use multi-epoll, which is a
>    bit of a testing nightmare.
> 

Why is it a testing nightmare?  Once the RPC message is received, both 
multi-epoll and own-thread are doing the same thing and handing off to a 
translator that can do a stack wind/unwind to start the message processing, am 
I right?  So the code path unifies at that point.  As stated above, both 
approaches have a similar level of race conditions that might be exposed.  
Shyam is of the opinion that we have already exposed many of them with Ganesha 
and SMB.    

> Obviously I'm not a fan of multi-epoll.  The first point suggests little
> or no benefit.  The second suggests greater risk.  The third is almost
> fatal all by itself, and BTW it was known all along.  Don't we have
> better things to do?

IMHO it's worth it to carefully trade off architectural purity in a few places 
to achieve improved  performance.  In summary, to back own-thread alternative I 
would need to see that a) the own-thread approach is scalable, and that b) 
performance data shows that own-thread is comparable to multi-thread-epoll in 
performance.  Otherwise, in the absence of any other candidates, we have to go 
with multi-thread-epoll.  I don't think we can make much progress without 
multi-threading RPC message reception. We have already reduced the number of 
system calls needed to receive an RPC as much as we can in some cases - this 
has helped, but it's just not enough (see bz 800892, 821087). 

opinions appreciated, now is time to speak up...

-ben

mtep.pdf
Description: Adobe PDF document

_______________________________________________
Gluster-devel mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Jeff Darcy's objections to multi-thread-epoll and proposal to use own-thread alternative

Reply via email to