Shannon -jj Behrens wrote:
On 2/2/07, Ian Bicking [EMAIL PROTECTED] wrote:
Shannon -jj Behrens wrote:
All of this can get really sticky, and I fear there are no good,
general answers. If you do decide to start killing long-running
threads, I do like the idea of letting the programmer explicitly state
that the thread should be long running. Do you really have a problem
of threads hanging? It's just not something I've had a problem with
in general.
Generally no, but occasionally yes, and that's enough to concern me.
Also currently there are no tools or even logs to really help someone
figure out what might be causing problems.
The specific project we're working on involves fetching other URLs,
which is something that can block in awkward ways. We have some ideas
to avoid that (probably not performing the subrequests in the request
thread), but even so I would like some additional places where we can
catch problems. Generally when something goes wrong I really don't like
the current behavior, which is that there's no way to notice until the
whole server stops responding, and no resolution except restarting the
server.
I don't think there's a firm general answer -- in an effort to protect
some requests from other requests, you might instead mess up the entire
machine (e.g., if you let the number of threads simply increase, which I
think is how the non-pooled httpserver would act currently). Or, you
may want to partition requests so that some family of requests is kept
separate from another family (e.g., we'd like to partition along domain
names), but that's a fairly complicated heuristics. And along with that
bursts of traffic are always fairly likely, and you don't want to
mistake those for actual problems -- that's just what you should expect
to happen.
I'd really like a Paste app to be something you can start up and just
depend on it to keep working indefinitely without lots of tending. This
is one of the pieces to make that happen. Actually, I think all that's
needed is:
1. Isolated Python environment (workingenv, virtual-python): without
this an installation can easily be broken by other activity on the
machine.
2. A process supervisor (supervisor2, daemontools): just in case it
segfaults.
3. Exception handling that actively tells you when things are broken.
E.g., if a database goes down everything will respond, but every page
will give a server error.
4. Of course, application state should never disappear because of a
process restart. In-memory sessions are right out as a result;
everything has to be serializable. That won't always work perfectly
(e.g., when there's a hard restart or a segfault), but doing a proper
restart should never be a problem.
5. Reasonable handling of these thread problems, if they occur.
Alternately a forking (or generally multi-process) server that monitors
its child processes could work. Sadly we don't have an HTTP server that
does that. I'm not sure if flup really monitors its children either, or
just spawns them and expects them to die.
6. Some monitor that checks URL(s) and handles when the URL is gone or
misbehaving. Ideally it could restart the process if the URL is just
gone or not responding (supervisor2 has an XMLRPC API, for instance).
Server errors should probably be handled via notification; restarts
don't (or at least shouldn't) just fix those.
7. In addition to looking for responding URLs, memory leaks (or greedy
CPU usage over a long time) would be good to detect. These are a little
trickier, and need a soft limit (when notification happens) then a hard
limit (when a restart is automatically done). Handling ulimit might be
enough, not sure.
Right now we have 1-4. Then we just need 5-7, and to plug them all
together nicely so people can easily deploy the entire combination. The
result should be something as reliable as PHP, and also reliable in
situations when the sysadmin really doesn't want to tend to individual
applications.
I don't have anything really useful to say. By the way, we're using
Nagios to provide *some* assurances that things haven't gone awry.
I've tried Nagios a little bit in the past, but found it rather hard to
set up for the small task I had in mind (just checking some URLs). And
it couldn't do something like restart a service (AFAIK). Still, this is
certainly something that any serious developer should have (be it Nagios
or mon or Big Brother or whatever).
It would be a nice addition to Pylons to add a convention for a pingable
URL in an application. The URL should do little work, but users could
add on to it -- typically if you have a database, you might check you
can connect to the database, for instance. Or check that critical
directories existed and were writable, etc.
I had intended PyPeriodic to be the basis for a URL checker, since the
periodic part felt harder to me than the actual URL fetching, and it
would