Re: [openstack-dev] [tc] supporting Go

Samuel Merritt Tue, 10 May 2016 13:00:05 -0700

On 5/9/16 5:21 PM, Robert Collins wrote:

On 10 May 2016 at 10:54, John Dickinson <m...@not.mn> wrote:

On 9 May 2016, at 13:16, Gregory Haynes wrote:


This is a bit of an aside but I am sure others are wondering the same
thing - Is there some info (specs/etherpad/ML thread/etc) that has more
details on the bottleneck you're running in to? Given that the only
clients of your service are the public facing DNS servers I am now even
more surprised that you're hitting a python-inherent bottleneck.


In Swift's case, the summary is that it's hard[0] to write a network
service in Python that shuffles data between the network and a block
device (hard drive) and effectively utilizes all of the hardware
available. So far, we've done very well by fork()'ing child processes,

...

Initial results from a golang reimplementation of the object server in
Python are very positive[1]. We're not proposing to rewrite Swift
entirely in Golang. Specifically, we're looking at improving object
replication time in Swift. This service must discover what data is on
a drive, talk to other servers in the cluster about what they have,
and coordinate any data sync process that's needed.

[0] Hard, not impossible. Of course, given enough time, we can do
 anything in a Turing-complete language, right? But we're not talking
 about possible, we're talking about efficient tools for the job at
 hand.

...

I'm glad you're finding you can get good results in (presumably)
clean, understandable code.

Given go's historically poor perfornance with multiple cores
(https://golang.org/doc/faq#Why_GOMAXPROCS) I'm going to presume the
major advantage is in the CSP programming model - something that
Twisted does very well: and frustratingly we've had numerous
discussions from folk in the Twisted world who see the pain we have
and want to help, but as a community we've consistently stayed with
eventlet, which has a threaded programming model - and threaded models
are poorly suited for the case here.

At its core, the problem is that filesystem IO can take a surprisinglylong time, during which the calling thread/process is blocked, andthere's no good asynchronous alternative.


Some background:

With Eventlet, when your greenthread tries to read from a socket and thesocket is not readable, then recvfrom() returns -1/EWOULDBLOCK; then,the Eventlet hub steps in, unschedules your greenthread, finds anunblocked one, and lets it proceed. It's pretty good at servicing abunch of concurrent connections and keeping the CPU busy.

On the other hand, when the socket is readable, then recvfrom() returnsquickly (a few microseconds). The calling process was technicallyblocked, but the syscall is so fast that it hardly matters.

Now, when your greenthread tries to read from a file, that read() calldoesn't return until the data is in your process's memory. This can takea surprisingly long time. If the data isn't in buffer cache and thekernel has to go fetch it from a spinning disk, then you're looking at aseek time of ~7 ms, and that's assuming there are no other pendingrequests for the disk.

There's no EWOULDBLOCK when reading from a plain file, either. If thefile pointer isn't at EOF, then the calling process blocks until thekernel fetches data for it.


Back to Swift:

The Swift object server basically does two things: it either reads froma disk and writes to a socket or vice versa. There's a little HTTPparsing in there, but the vast majority of the work is shuffling bytesbetween network and disk. One Swift object server can service manyclients simultaneously.

The problem is those pauses due to read(). If your process is servicinghundreds of clients reading from and writing to dozens of disks (in,say, a 48-disk 4U server), then all those little 7 ms waits are prettybad for throughput. Now, a lot of the time, the kernel does somereadahead so your read() calls can quickly return data from buffercache, but there are still lots of little hitches.

But wait: it gets worse. Sometimes a disk gets slow. Maybe it's got alot of pending IO requests, maybe its filesystem is getting close tofull, or maybe the disk hardware is just starting to get flaky. Forwhatever reason, IO to this disk starts taking a lot longer than 7 ms onaverage; think dozens or hundreds of milliseconds. Now, every time yourprocess tries to read from this disk, all other work stops for quite along time. The net effect is that the object server's throughputplummets while it spends most of its time blocked on IO from that oneslow disk.

Now, of course there's things we can do. The obvious one is to use acouple of IO threads per disk and push the blocking syscalls outthere... and, in fact, Swift did that. In commit b491549, the objectserver gained a small threadpool for each disk[1] and started doing itsIO there.

This worked pretty well for avoiding the slow-disk problem. Requeststhat touched the slow disk would back up, but requests for the otherdisks in the server would proceed at a normal pace. Good, right?

The problem was all the threadpool overhead. Remember, a significantfraction of the time, write() and read() only touch buffer cache, so thesyscalls are very fast. Adding in the threadpool overhead in Pythonslowed those down. Yes, if you were hit with a 7 ms read penalty, thethreadpool saved you, but if you were reading from buffercache then youjust paid a big cost for no gain.

On some object-server nodes where the CPUs were already fully-utilized,people saw a 25% drop in throughput when using the Python threadpools.It's not worth that performance loss just to gain protection from slowdisks.

The second thing Swift tried was to run separate object-server processesfor each disk [2]. This also mitigates slow disks, but it avoids thethreadpool overhead. The downside here is that dense nodes end up withlots of processes; for example, a 48-disk node with 2 object servers perdisk will end up with about 96 object-server processes running. Whilethese processes aren't particularly RAM-heavy, that's still a decentchunk of memory that could have been holding directories in buffer cache.

Aside: there's a few other things we looked at but rejected. Using LinuxAIO (kernel AIO, not POSIX libaio) would let the object server have manypending IOs cheaply, but it only works in O_DIRECT mode, so there's nobuffer cache. We also looked at the readv2() syscall to let us performbuffer-cache-only reads in the main thread and use a blocking read()syscall in a threadpool, but unfortunately readv2() and preadv2() onlyhit Linux in March 2016, so people running such ancient software asUbuntu Xenial Xerus [3] can't use it.

Now, the Go runtime is really good at making blocking syscalls indedicated threads. Basically, there are $GOMAXPROCS threads actuallyrunning goroutines, and a bunch of syscall threads that are used tomake blocking system calls. This lets a single Go object server processhave many outstanding IOs on many disks without blocking the wholeprocess. Further, since it's a single process, we can easily getslow-disk mitigation by limiting the number of concurrent requests per disk.

It's better than anything we've come up with in Python. It's a singleprocess, freeing up RAM for caching directories; slow-disk mitigation isreally easy to build; and all that blocking-syscall stuff is handled bythe language runtime.



[1] configurable size, including 0 for those who didn't want it

[2] technically per IP/port pair in the ring, but intended to be usedwith one port per disk, getting you N servers per disk; see commit df134df


[3] released a whole three weeks ago

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tc] supporting Go

Reply via email to