Shannon -jj Behrens wrote:
> On 2/2/07, Ian Bicking <[EMAIL PROTECTED]> wrote:
>> Shannon -jj Behrens wrote:
>> > All of this can get really sticky, and I fear there are no good,
>> > general answers.  If you do decide to start killing long-running
>> > threads, I do like the idea of letting the programmer explicitly state
>> > that the thread should be long running.  Do you really have a problem
>> > of threads hanging?  It's just not something I've had a problem with
>> > in general.
>>
>> Generally no, but occasionally yes, and that's enough to concern me.
>> Also currently there are no tools or even logs to really help someone
>> figure out what might be causing problems.
>>
>> The specific project we're working on involves fetching other URLs,
>> which is something that can block in awkward ways.  We have some ideas
>> to avoid that (probably not performing the subrequests in the request
>> thread), but even so I would like some additional places where we can
>> catch problems.  Generally when something goes wrong I really don't like
>> the current behavior, which is that there's no way to notice until the
>> whole server stops responding, and no resolution except restarting the
>> server.
>>
>> I don't think there's a firm general answer -- in an effort to protect
>> some requests from other requests, you might instead mess up the entire
>> machine (e.g., if you let the number of threads simply increase, which I
>> think is how the non-pooled httpserver would act currently).  Or, you
>> may want to partition requests so that some family of requests is kept
>> separate from another family (e.g., we'd like to partition along domain
>> names), but that's a fairly complicated heuristics.  And along with that
>> bursts of traffic are always fairly likely, and you don't want to
>> mistake those for actual problems -- that's just what you should expect
>> to happen.
>>
>> I'd really like a Paste app to be something you can start up and just
>> depend on it to keep working indefinitely without lots of tending.  This
>> is one of the pieces to make that happen.  Actually, I think all that's
>> needed is:
>>
>> 1. Isolated Python environment (workingenv, virtual-python): without
>> this an installation can easily be broken by other activity on the 
>> machine.
>>
>> 2. A process supervisor (supervisor2, daemontools): just in case it
>> segfaults.
>>
>> 3. Exception handling that actively tells you when things are broken.
>> E.g., if a database goes down everything will respond, but every page
>> will give a server error.
>>
>> 4. Of course, application state should never disappear because of a
>> process restart.  In-memory sessions are right out as a result;
>> everything has to be serializable.  That won't always work perfectly
>> (e.g., when there's a hard restart or a segfault), but doing a proper
>> restart should never be a problem.
>>
>> 5. Reasonable handling of these thread problems, if they occur.
>> Alternately a forking (or generally multi-process) server that monitors
>> its child processes could work.  Sadly we don't have an HTTP server that
>> does that.  I'm not sure if flup really monitors its children either, or
>> just spawns them and expects them to die.
>>
>> 6. Some monitor that checks URL(s) and handles when the URL is gone or
>> misbehaving.  Ideally it could restart the process if the URL is just
>> gone or not responding (supervisor2 has an XMLRPC API, for instance).
>> Server errors should probably be handled via notification; restarts
>> don't (or at least shouldn't) just fix those.
>>
>> 7. In addition to looking for responding URLs, memory leaks (or greedy
>> CPU usage over a long time) would be good to detect.  These are a little
>> trickier, and need a soft limit (when notification happens) then a hard
>> limit (when a restart is automatically done).  Handling ulimit might be
>> enough, not sure.
>>
>>
>> Right now we have 1-4.  Then we just need 5-7, and to plug them all
>> together nicely so people can easily deploy the entire combination.  The
>> result should be something as reliable as PHP, and also reliable in
>> situations when the sysadmin really doesn't want to tend to individual
>> applications.
> 
> I don't have anything really useful to say.  By the way, we're using
> Nagios to provide *some* assurances that things haven't gone awry.

I've tried Nagios a little bit in the past, but found it rather hard to 
set up for the small task I had in mind (just checking some URLs).  And 
it couldn't do something like restart a service (AFAIK).  Still, this is 
certainly something that any serious developer should have (be it Nagios 
or mon or Big Brother or whatever).

It would be a nice addition to Pylons to add a convention for a pingable 
URL in an application.  The URL should do little work, but users could 
add on to it -- typically if you have a database, you might check you 
can connect to the database, for instance.  Or check that critical 
directories existed and were writable, etc.

I had intended PyPeriodic to be the basis for a URL checker, since the 
periodic part felt harder to me than the actual URL fetching, and it 
would be nice/easy to combine that error reporting with other error 
reporting around background tasks (and for Python tasks that could be 
done with paste.exceptions fairly easily).  But I didn't get very far 
with that.

-- 
Ian Bicking | [EMAIL PROTECTED] | http://blog.ianbicking.org

_______________________________________________
Paste-users mailing list
[email protected]
http://webwareforpython.org/cgi-bin/mailman/listinfo/paste-users

Reply via email to