On 16 Jun 2011, at 12:01 AM, Paul Querna wrote:

I think we have all joked on and off about 3.0 for... well about 8 years now.

I think we are nearing the point we might actually need to be serious about it.

The web is changed.

SPDY is coming down the pipe pretty quickly.

WebSockets might actually be standardized this year.

Two protocols which HTTPD is unable to be good at. Ever.

The problem is our process model, and our module APIs.

I am not convinced.

Over the last three years, I have developed a low level stream serving system that we use to disseminate diagnostic data across datacentres, and one of the basic design decisions was that it was to be lock free and event driven, because above all it needed to be fast. The event driven stuff was done properly, based on religious application of the following rule:

"Thou shalt not attempt any single read or write without the event loop giving you permission to do that single read or write first. Not a single attempt, ever."

From that effort I've learned the following:

- Existing APIs in unix and windows really really suck at non blocking behaviour. Standard APR file handling couldn't do it, so we couldn't use it properly. DNS libraries are really terrible at it. The vast majority of "async" DNS libraries are just hidden threads which wrap attempts to make blocking calls, which in turn means unknown resource limits are hit when you least expect it. Database and LDAP calls are blocking. What this means practically is that you can't link to most software out there.

- You cannot block, ever. Think you can cheat and just make a cheeky attempt to load that file quickly while nobody is looking? Your hard disk spins down, your network drive is slow for whatever reason, and your entire server stops dead in its tracks. We see this choppy behaviour in poorly written user interface code, we see the same choppy behaviour in cheating event driven webservers.

- You have zero room for error. Not a single mistake can be tolerated. One foot wrong, the event loop spins. Step one foot wrong the other way, and your task you were doing evaporates. Finding these problems is painful, and your server is unstable until you do.

- You have to handle every single possible error condition. Every single one. Miss one? You suddenly drop out of an event handler, and your event loop spins, or the request becomes abandoned. You have no room for error at all.

We have made our event driven code work because it does a number of very simple and small things, and it's designed to do these simple and small things well, and we want it to be as compact and fast as humanly possible, given that datacentre footprint is our primary constraint.

Our system is like a sportscar, it's fast, but it breaks down if we break the rules. But for us, we are prepared to abide by the rules to achieve the speed we need.

Let's contrast this with a web server.

Webservers are traditionally fluid beasts, that have been and continue to be moulded and shaped that way through many many ever changing requirements from webmasters. They have been made modular and extensible, and those modules and extensions are written by people with different programming ability, to different levels of tolerances, within very different budget constraints.

Simply put, webservers need to tolerate error. They need to be built like tractors.

Unreliable code? We have to work despite that. Unhandled error conditions? We have to work despite that. Code that was written in a hurry on a budget? We have to work despite that.

Are we going to be sexy? Of course not. But while the sportscar is broken down at the side of the road, the tractor just keeps going.

Why does our incredibly unsexy architecture help webmasters? Because prefork is bulletproof. Leak, crash, explode, hang, the parent will clean up after us. Whatever we do, within reason, doesn't affect the process next door. If things get really dire, we're delayed for a while, and we recover when the problems pass. Does the server die? Pretty much never. What if we trust our code? Well, worker may help us. Crashes do affect the request next door, but if they're rare enough we can tolerate it. The event mpm? It isn't truly an even mpm, it is rather more efficient when it comes to keepalives and waiting for connections, where we hand this problem to an event loop that doesn't run anyone else's code within it, so we're still reliable despite the need for a higher standard of code accuracy.

If you've ever been in a situation where a company demands more speed out of a webserver, wait until you sacrifice reliability giving them the speed. Suddenly they don't care about the speed, reliability becomes top priority again, as it should be.

So, to get round to my point. If we decide to relook at the architecture of v3.0, we should be careful to ensure that we don't stop offering a "tractor mode", as this mode is our killer feature.. There are enough webservers out there that try to be event driven and sexy, and then fall over on reliability. Or alternatively, there are webservers out there that try to be event driven and sexy, and succeed at doing so because they keep their feature set modest, keep extensibility to a minimum and avoid touching blocking calls to disks and other blocking devices. Great for load balancers, not so great for anything else.

Apache httpd has always had at it's heart the ability to be practically extensible, while remaining reliable, and I think we should continue to do that.

Regards,
Graham
--

Reply via email to