As the maintainer of several dozen tools, this happens on a regular basis. No automatic notification, nor automatic restart. Pitiful, really.
Hedonil has written a set of scripts to run the webservice in a more reliable manner, and even has an "auto-restarter", which I use for some of the tools where the standard webservice used to die on an almost daily basis. Tools Labs should really improve this. On Tue, Jun 10, 2014 at 10:28 AM, Merlijn van Deen <[email protected]> wrote: > Hello all, > > My 'tsreports' webservice randomly dies every now and then. qacct suggests > this is due to OOM: > > tools.tsreports@tools-login:~$ qacct -j 487745 > qname webgrid-lighttpd > (...) > jobname lighttpd-tsreports > jobnumber 487745 > (...) > qsub_time Wed Apr 23 08:18:12 2014 > start_time Fri May 23 14:30:17 2014 > end_time Fri Jun 6 10:51:21 2014 > (...) > failed 0 > exit_status 0 > (...) > maxvmem 3.973G > > > I have no clue how to debug this, though; the lighttpd error log just shows > > 2014-06-06 10:51:20: (mod_fastcgi.c.3061) got proc: pid: 12119 socket: > unix:/tmp/tsreports-index.fcgi.sock-0 load: 1 > 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 > 2014-06-06 10:51:20: (server.c.1502) unlink failed for: > /var/run/lighttpd/tsreports.pid 2 No such file or directory > 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 > 2014-06-06 10:51:20: (server.c.1502) unlink failed for: > /var/run/lighttpd/tsreports.pid 2 No such file or directory > 2014-06-06 10:51:20: (server.c.1502) unlink failed for: > /var/run/lighttpd/tsreports.pid 2 No such file or directory > 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 > 2014-06-06 10:51:21: (server.c.1502) unlink failed for: > /var/run/lighttpd/tsreports.pid 2 No such file or directory > 2014-06-06 10:51:21: (server.c.1512) server stopped by UID = 0 PID = 12087 > 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 > > which is not very informative, to say the least. > > So: how can one debug these issues? > > To add insult to the injury, SGE doesn't even send an e-mail to tell me it > killed the webserver, nor does it re-start the webserver. Either of those > would be reasonable (especially the option 'restart the webserver'). Now I > had to be notified by someone on my talk page... > > Merlijn > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > > -- undefined
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
