Shannon -jj Behrens wrote: > On 2/2/07, Ian Bicking <[EMAIL PROTECTED]> wrote: >> Shannon -jj Behrens wrote: >> > All of this can get really sticky, and I fear there are no good, >> > general answers. If you do decide to start killing long-running >> > threads, I do like the idea of letting the programmer explicitly state >> > that the thread should be long running. Do you really have a problem >> > of threads hanging? It's just not something I've had a problem with >> > in general. >> >> Generally no, but occasionally yes, and that's enough to concern me. >> Also currently there are no tools or even logs to really help someone >> figure out what might be causing problems. >> >> The specific project we're working on involves fetching other URLs, >> which is something that can block in awkward ways. We have some ideas >> to avoid that (probably not performing the subrequests in the request >> thread), but even so I would like some additional places where we can >> catch problems. Generally when something goes wrong I really don't like >> the current behavior, which is that there's no way to notice until the >> whole server stops responding, and no resolution except restarting the >> server. >> >> I don't think there's a firm general answer -- in an effort to protect >> some requests from other requests, you might instead mess up the entire >> machine (e.g., if you let the number of threads simply increase, which I >> think is how the non-pooled httpserver would act currently). Or, you >> may want to partition requests so that some family of requests is kept >> separate from another family (e.g., we'd like to partition along domain >> names), but that's a fairly complicated heuristics. And along with that >> bursts of traffic are always fairly likely, and you don't want to >> mistake those for actual problems -- that's just what you should expect >> to happen. >> >> I'd really like a Paste app to be something you can start up and just >> depend on it to keep working indefinitely without lots of tending. This >> is one of the pieces to make that happen. Actually, I think all that's >> needed is: >> >> 1. Isolated Python environment (workingenv, virtual-python): without >> this an installation can easily be broken by other activity on the >> machine. >> >> 2. A process supervisor (supervisor2, daemontools): just in case it >> segfaults. >> >> 3. Exception handling that actively tells you when things are broken. >> E.g., if a database goes down everything will respond, but every page >> will give a server error. >> >> 4. Of course, application state should never disappear because of a >> process restart. In-memory sessions are right out as a result; >> everything has to be serializable. That won't always work perfectly >> (e.g., when there's a hard restart or a segfault), but doing a proper >> restart should never be a problem. >> >> 5. Reasonable handling of these thread problems, if they occur. >> Alternately a forking (or generally multi-process) server that monitors >> its child processes could work. Sadly we don't have an HTTP server that >> does that. I'm not sure if flup really monitors its children either, or >> just spawns them and expects them to die. >> >> 6. Some monitor that checks URL(s) and handles when the URL is gone or >> misbehaving. Ideally it could restart the process if the URL is just >> gone or not responding (supervisor2 has an XMLRPC API, for instance). >> Server errors should probably be handled via notification; restarts >> don't (or at least shouldn't) just fix those. >> >> 7. In addition to looking for responding URLs, memory leaks (or greedy >> CPU usage over a long time) would be good to detect. These are a little >> trickier, and need a soft limit (when notification happens) then a hard >> limit (when a restart is automatically done). Handling ulimit might be >> enough, not sure. >> >> >> Right now we have 1-4. Then we just need 5-7, and to plug them all >> together nicely so people can easily deploy the entire combination. The >> result should be something as reliable as PHP, and also reliable in >> situations when the sysadmin really doesn't want to tend to individual >> applications. > > I don't have anything really useful to say. By the way, we're using > Nagios to provide *some* assurances that things haven't gone awry.
I've tried Nagios a little bit in the past, but found it rather hard to set up for the small task I had in mind (just checking some URLs). And it couldn't do something like restart a service (AFAIK). Still, this is certainly something that any serious developer should have (be it Nagios or mon or Big Brother or whatever). It would be a nice addition to Pylons to add a convention for a pingable URL in an application. The URL should do little work, but users could add on to it -- typically if you have a database, you might check you can connect to the database, for instance. Or check that critical directories existed and were writable, etc. I had intended PyPeriodic to be the basis for a URL checker, since the periodic part felt harder to me than the actual URL fetching, and it would be nice/easy to combine that error reporting with other error reporting around background tasks (and for Python tasks that could be done with paste.exceptions fairly easily). But I didn't get very far with that. -- Ian Bicking | [EMAIL PROTECTED] | http://blog.ianbicking.org _______________________________________________ Paste-users mailing list [email protected] http://webwareforpython.org/cgi-bin/mailman/listinfo/paste-users
