On Wed, 2 Jul 2008, Perry E. Metzger wrote:
Seeing is believing. There are lots of good papers out there on concurrency strategies for systems with vast numbers of sockets to manage, and there is no doubt what the answer is -- threads suck compared to events, full stop. Event systems scale linearly for far longer.
Sure, but:
It depends. If you're doing something where there is going to be one socket talking to the system a tiny percentage of the time, why would you bother building an event driven server? If you're building
Or many sockets, but with a task granularity that makes your possibly megaclock/millisecond task switching overhead irrelevant. I'd have to rebuild and rerun lmbench to get an accurate measure of the current context switch time, but milliseconds seems far too long. That sounds like the inverse of the timeslice, not the actual CS time, which I'm pretty sure has been MICROseconds, not milliseconds, since back in the 2.0 kernel on e.g. 200 MHz hardware. My laptop is clocking over 2000 CS's/second sitting nearly idle -- that's just the "noise" of the system's normal interactive single user function, and this when its clock is at its idling 800 MHz (out of 2.5 GHz). On the physics department home directory (NFS) server I'm clocking only 7000-7500 CS/sec at a load average of 0.3 (dual core opteron at 2.8 GHz). Since nfsstat doesn't seem to do rates yet (and has int counters instead of long long uints, grrr) it is a bit difficult to see exactly what this is derived from in real time as far as load goes, but almost all of it seems to be the processing of interrupts, as in the interrupt count and the context switch count go in close parallel. Now, I'm trying to understand the advantage you describe, so bear with me. See what you think of the following: The kernel processes interrupts as efficiently as possible, with upper half and lower half handlers, but either one requires that the CPU stop userspace tasks and load up the kernel's interrupt handler, which requires moving in and out of kernel mode. We aren't going to quibble about a factor of two here and I think that this is well within a factor of two of a context switch time as most of the state of the CPU still has to be saved so I'm calling it a CS even if it is maybe 80% of a CS as far as time goes, depending on how badly one is thrashing the caches. Network based requests are associated with packets from different sources; packets require interrupts to process, interrupts from distinct sources require context switches to process (do they not? -- I'm not really sure here but I recall seeing context switch counts generally rise roughly linearly with interrupt rates even on single threaded network apps) so I would EXPECT the context switch load imposed by a network app to be within a LINEAR factor of two to four independent of whether it was run as a fork or run via events with one exception, described below. I'm estimating the load as packet arrives (I,CS), app gets CPU (CS), select on FD triggers it's "second stage interrupt/packet handler" which computes a result and writes to the network (I,CS), causes the process to block on select, kernel does next task (CS). So I count something like two interrupts and four context switches per network transaction for a single, single threaded network server application handling a single, single packet request with a single, single packet reply. If it has to go to disk, add at least one interrupt/context switch. So it is four or five context switches, some of which may be lighter weight than others, and presumes that the process handling the connection already exists. If the process was created by a fork or as a thread, there is technical stuff about whether it is a kernel thread (expensive) or user thread (much lighter weight, in fact, much more like a procedure call since the forked processes share a memory space and presumably do NOT need to actually save state when switching between them). They still might require a large chunk of a CS time to switch between them because I don't know how the kernel task switcher manages select statements on the entire cluster of children associated with the toplevel parent -- if it sweeps through them without changing context there is one level of overhead, if it fully changes context there is another, but again I think we're talking 10-20% differences. Now, if it is a SINGLE process with umpty open file descriptors and an event loop that uses SOME system call -- and I don't quite understand how a userspace library can do anything but use the same systems calls I could implement in my own code to check for data to be read on a FD, e.g. select or some sort of non-blocking IO poll -- each preexisting, persistent connection (open FD) requires at least 2-3 of the I/CS pairs in order to handle a request. In other words, it saves the CS associated with switching the toplevel task (and permits the kernel to allocate timeslices to its task list with a larger granularity, saving the cost of going in/out of kernel mode to initiate the task/context switch). This saves around two CS out of four or five, a factor of around two improvement. Not that exciting, really. The place where you get really nailed by forks or threads is in their creation. Creating a process is very expensive, a couple of orders of magnitude more expensive than switching processes, and yeah, even order ms. So if you are handling NON-persistent connections -- typical webserver behavior: make a connection, send a request for a file, receive the file requested, break the connection -- handling this with a unique fork or thread per request is absolute suicide. So there a sensible strategy is to pre-initiate enough threads to be able to handle the incoming request stream round-robin, so that as each thread takes a request, processes it, and resumes listening for the next request. This requires the overhead of creating a FD (inevitable for each connection), dealing with the interrupt/CSs required to process the request and deliver the data, then close/free the FD. If the number of daemons required to process connections at incoming saturation is small enough that the overhead associated with processing the task queue doesn't get out of hand, this should scale very nearly as well as event processing, especially if the daemons are a common fork and share an address space. The last question is just how efficiently the kernel processes many blocked processes. Here I don't know the answer, and before looking it up I'll post the question here where probably, somebody does;-) If the connections are PERSISTENT -- e.g. imap connections forked by a mail server for mail clients that connect and stay connected for hours or days -- then as a general rule there will be no I/O waiting on the connections because humans type or click slowly and erratically, mostly a poissonian load distribution. If the kernel has a way of flagging applications that are blocked on a select on an FD without doing an actual context switch into the application, the scheduler can rip through all the blocked tasks without a CS per task, at an overhead rate within a CS or two per ACTIVE task (one where there IS I/O waiting) of the hyperefficient event-driven server that basically stays on CPU except for when the CPU goes back to the kernel anyway to handle the packet stream during interrupts and to advance the timer and so on. I don't KNOW if the kernel manages runnable or blocked at quite this level -- it does seem that there are fields in the task process table that flag it, though, so I'd guess that it does. It seems pretty natural to skip blocked processes without an actual CS in and out just to determine that they are blocked, since many processes spend a lot of time waiting on I/O that can take ms to unblock (e.g. non-cached disk I/O). So I'm not certain that having a large number of idle, blocked processes (waiting on I/O on a FD with a select, for example) is a problem with context switches per se.
something to serve files to 20,000 client machines over persistent TCP connections and the network interface is going to be saturated, hell yes, you should never use 20,000 threads for that, write the thing event driven or you'll die.
Here there are a couple of things. One is that 20K processes MIGHT take 20K context switches just to see if they are blocked on I/O. If they do, then you are definitely dead. 20K processes also at the very least require 20K entries in the kernel process table, and even looping over them in the scheduler to check for an I/O flag with no CS is going to start to take time and eat a rather large block of memory and maybe even thrash the cache. So I absolutely agree, 20K mostly-idle processes on a running system -- even a multicore with lots of memory -- is a bad idea even if they are NOT processing network requests. Fortunately, this is so obvious that I don't think anybody sane would ever try to do this. Second, 20K NON-persistent connections on an e.g. webserver would be absolute insanity, as it adds the thread creation/destruction overhead to the cost of processing the single-message-per-connection interrupts. It just wouldn't work, so people wouldn't do that. IIRC there were a few daemons that did that back in the 80's (when I was managing Suns) and there were rules of thumb on running them. "Don't" is the one I recall, at least if one had more than a handful of hosts connecting. Running 10-20 parallel daemons might work, and people do that -- httpd, nfsd. Running an event driven server daemon (or parallel/network application) would work, and people do that -- pvmd does that, I believe. Which one works the best? I'm perfectly happy to believe that the event driven server could manage roughly twice as many make/break single message connections as a pile of daemons, if the processes aren't bottlenecked somewhere other than at interrupt/context switches. If we assume that at CS takes order of 1-10 usec on a modern system, and it takes a SMALLER amount of time to do the processing associated with a request, then you'll get the advantage. If each request takes (say) order of 100 usec to handle, then you'll be bottlenecked at less than 10,000 requests per second anyway, and I don't think that you'd see any advantage at all, although this depends strongly on whether or not one can block all the daemons somewhere OTHER than the network. The question then is -- what kind of traffic is e.g. an NFS server or a mail server as opposed to a web server? NFS service requires (typically) at least an fstat per file, and may or may not require physical disk access with millisecond scale latencies. Caching reduces this by orders of magnitude, but some patterns of access (especially write access or a nasty mix of many small requests -- latency bound disk accesses) don't necessarily cache well. It is not at all clear, then, that an event driven NFS server would ultimately scale out better than a small pile of NFS daemons as the bottleneck could easily end up being the disk, not the context switch or interrupt burden associated with the network. Mail servers ditto, as they too are basically file servers, but ones for which caching is of little or no advantage. Event driven servers might get you the ability to support as much as a factor of two more connections without dying, but it is more likely that other bottlenecks would kill your performance at about the same number of connections either way. To bring the whole thing around OT again, a very reasonable question is what kind of application one is likely to encounter in parallel computing and which of the three models discussed (forking per connection, forking a pile of daemons to handle connections round robin, single server/daemon handling a table of FDs) is likely to be best. I'd argue that in the case of parallel computing it is ALMOST completely irrelevant -- all three would work well. If one starts up a single e.g. pvmd or lamd, which forks off connected parallel applications on request, then typically there will a) only be roughly 1 such fork per core per system, because the system will run maximally efficiently if it can just keep streaming memory streaming in and out of L1 and L2; b) they will have a long lifetime, so the cost of the fork itself is irrelevant -- a ms out of hours to days of computing; c) internally the applications are already written to be event driven, in the sense that they maintain their own tables of FDs and manage I/O either at the level of the toplevel daemons (who then provide it as streams to the applications) or within the applications themselves via library calls and structures. I THINK PVM is more the former model and MPI the latter, but there are many MPIs. For other associated cluster stuff -- a scheduler daemon, an information daemon such as xmlsysd in wulfstat -- forking vs non-forking for persistent connections (ones likely to last longer than minutes) is likely to be irrelevantly different. Again, pay the ms to create the fork, pay 6 interrupt/context switches instead of 4 or 5 per requested service with a marginal cost of maybe 10 usec, and unless one is absolutely hammering the daemon and the work done by the daemon has absolutely terrible granularity (so it is only DOING order of 10 or 100 usec of work per return) it is pretty ignorable, especially on a system that is PRESUMABLY spending 99% of its time computing and the daemon is basically handling out of band task monitoring or control services.
It is all about the right tool for the job. Apps that are all about massive concurrent communication need events. Apps that are about very little concurrent communication probably don't need them.
Absolutely, but do they need libevents, or do they simply need to be sensibly written to manage a table of fds and selects or nonblocking polls? I've grabbed the source for libevents and am looking through it, but again, it seems to me that it is limited to using the systems calls the kernel provides to handle I/O on open FDs, and if so the main reason to use a library rather than the calls directly is ease of coding, not necessarily efficiency. Usually the code would be more efficient if you did the same thing(s) inline, would it not? The one thing I completely agree with is that one absolutely must remain aware of the high cost of creating and destroying threads/processes. Forking is expensive, and forking to handle a high-volume stream of transient connections is dumb. So dumb that it doesn't work, so nobody does this, I think. At least, not for long.
More the former, not the latter. Event driven programming typically uses registered callbacks that are triggered by a central "Event Loop" when events happen. In such a system, one never blocks for anything -- all activity is performed in callbacks, and one simply returns from a callback if one can't proceed further. The programming paradigm is quite alien to most people.
Fair enough, because most people don't write heavily parallel applications (which includes applications with many parallel I/O streams, not just HPC). But people who do fairly quickly learn to work out the scaling and overhead, do they not, at least "well enough" to achieve some level of performance? Otherwise the applications just fail to work and people don't use them. Evolution in action...;-) This has been a very informative discussion so far, at least for me. Even if my estimates above are all completely out of line and ignore some key thing, all that means is I'll learn even more. The one thing that I wish were written with some sort of internal scheduler/kernel and event mechanism from the beginning is X. It has its own event loop, but event-driven callbacks all block -- there is no internal task parallelism. It is a complete PITA to write an X application that runs a thread continuously but doesn't block the operation of the GUI -- one has to handle state information, use a separate thread, or invert all sorts of things from the normal X paradigm of click and callback. That is, most X apps are written to be serial and X itself is designed to support serial operation, but INTERESTING X apps are parallel, where the UI-linked I/O channels have to be processed "independently" within the X event loop while a separate thread is doing a task loop of interesting work. AFAIK, X only supports its own internal event loop and has horrible kludges to get the illusion of task parallelism unless one just forks a separate thread for the running "work" process and establishes shared state stuctures and so on so that the UI callbacks can safely affect work going on in the work-loop thread without blocking it.
I'd read the libevent man page to get a vague introduction.
There doesn't seem to be one in the source tarball I downloaded. Only event.3 and evdns.3, neither of which are terribly informative. In fact, the documentation sucks. There is more space on the website devoted to pictures of good vs terrible scaling with/without libevent than there is documentation of how it works or how to use it, and of course it is difficult to know if the figures are straw men or fair comparisons. There are a few chunks of sample code in samples. I'll take a look and see what I can see when I have time. I'm working on an X-based GUI-controlled application and do have a forking daemon (xmlsysd) that so far seems to work fine at the level of traffic it was designed for and is likely bottlenecked someplace other than CSs long before they become a problem, but this conversation has convinced me that I could rewrite at least the latter in a way that is more efficient even if I do leave it forking per connection (or using xinetd, as it usually does now:-). It is a monitoring daemon, and is fairly lightweight now because one doesn't want to spend resources watching a cluster spend resources. If I redesigned it along lines suggested by the analysis above, I could permit it to manage many more connections with one part of its work accomplished with roughly constant overhead, where now the overhead associated with that work scales linearly with the number of connections. rgb
-- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
