+1 amen to reliability coming first. We run all kinds of awful code in 
production at the ASF, and httpd's design papers over that elegantly.  Losing 
that would be a terrible blow to the utility of the project.

Sent from my iPhone

On Jun 15, 2011, at 7:33 PM, Graham Leggett <minf...@sharp.fm> wrote:

On 16 Jun 2011, at 12:01 AM, Paul Querna wrote:

I think we have all joked on and off about 3.0 for... well about 8 years now.

I think we are nearing the point we might actually need to be serious about it.

The web is changed.

SPDY is coming down the pipe pretty quickly.

WebSockets might actually be standardized this year.

Two protocols which HTTPD is unable to be good at. Ever.

The problem is our process model, and our module APIs.

I am not convinced.

Over the last three years, I have developed a low level stream serving system 
that we use to disseminate diagnostic data across datacentres, and one of the 
basic design decisions was that  it was to be lock free and event driven, 
because above all it needed to be fast. The event driven stuff was done 
properly, based on religious application of the following rule:

"Thou shalt not attempt any single read or write without the event loop giving 
you permission to do that single read or write first. Not a single attempt, 
ever."

>From that effort I've learned the following:

- Existing APIs in unix and windows really really suck at non blocking 
behaviour. Standard APR file handling couldn't do it, so we couldn't use it 
properly. DNS libraries are really terrible at it. The vast majority of "async" 
DNS libraries are just hidden threads which wrap attempts to make blocking 
calls, which in turn means unknown resource limits are hit when you least 
expect it. Database and LDAP calls are blocking. What this means practically is 
that you can't link to most software out there.

- You cannot block, ever. Think you can cheat and just make a cheeky attempt to 
load that file quickly while nobody is looking? Your hard disk spins down, your 
network drive is slow for whatever reason, and your entire server stops dead in 
its tracks. We see this choppy behaviour in poorly written user interface code, 
we see the same choppy behaviour in cheating event driven webservers.

- You have zero room for error. Not a single mistake can be tolerated. One foot 
wrong, the event loop spins. Step one foot wrong the other way, and your task 
you were doing evaporates. Finding these problems is painful, and your server 
is unstable until you do.

- You have to handle every single possible error condition. Every single one. 
Miss one? You suddenly drop out of an event handler, and your event loop spins, 
or the request becomes abandoned. You have no room for error at all.

We have made our event driven code work because it does a number of very simple 
and small things, and it's designed to do these simple and small things well, 
and we want it to be as compact and fast as humanly possible, given that 
datacentre footprint is our primary constraint.

Our system is like a sportscar, it's fast, but it breaks down if we break the 
rules. But for us, we are prepared to abide by the rules to achieve the speed 
we need.

Let's contrast this with a web server.

Webservers are traditionally fluid beasts, that have been and continue to be 
moulded and shaped that way through many many ever changing requirements from 
webmasters. They have been made modular and extensible, and those modules and 
extensions are written by people with different programming ability, to 
different levels of tolerances, within very different budget constraints.

Simply put, webservers need to tolerate error. They need to be built like 
tractors.

Unreliable code? We have to work despite that. Unhandled error conditions? We 
have to work despite that. Code that was written in a hurry on a budget? We 
have to work despite that.

Are we going to be sexy? Of course not. But while the sportscar is broken down 
at the side of the road, the tractor just keeps going.

Why does our incredibly unsexy architecture help webmasters? Because prefork is 
bulletproof. Leak, crash, explode, hang, the parent will clean up after us. 
Whatever we do, within reason, doesn't affect the process next door. If things 
get really dire, we're delayed for a while, and we recover when the problems 
pass. Does the server die? Pretty much never. What if we trust our code? Well, 
worker may help us. Crashes do affect the request next door, but if they're 
rare enough we can tolerate it. The event mpm? It isn't truly an even mpm, it 
is rather more efficient when it comes to keepalives and waiting for 
connections, where we hand this problem to an event loop that doesn't run 
anyone else's code within it, so we're still reliable despite the need for a 
higher standard of code accuracy.

If you've ever been in a situation where a company demands more speed out of a 
webserver, wait until you sacrifice reliability giving them the speed. Suddenly 
they don't care about the speed, reliability becomes top priority again, as it 
should be.

So, to get round to my point. If we decide to relook at the architecture of 
v3.0, we should be careful to ensure that we don't stop offering a "tractor 
mode", as this mode is our killer feature.. There are enough webservers out 
there that try to be event driven and sexy, and then fall over on reliability. 
Or alternatively, there are webservers out there that try to be event driven 
and sexy, and succeed at doing so because they keep their feature set modest, 
keep extensibility to a minimum and avoid touching blocking calls to disks and 
other blocking devices. Great for load balancers, not so great for anything 
else.

Apache httpd has always had at it's heart the ability to be practically 
extensible, while remaining reliable, and I think we should continue to do that.

Regards,
Graham
--


Reply via email to