Go with hedonil's scripts. They're very good. Gesendet von Maximilian's iPhone. (Sent from Maximilian's iPhone.)
> On Jun 10, 2014, at 06:36, Magnus Manske <[email protected]> wrote: > > As the maintainer of several dozen tools, this happens on a regular basis. No > automatic notification, nor automatic restart. Pitiful, really. > > Hedonil has written a set of scripts to run the webservice in a more reliable > manner, and even has an "auto-restarter", which I use for some of the tools > where the standard webservice used to die on an almost daily basis. > > Tools Labs should really improve this. > > >> On Tue, Jun 10, 2014 at 10:28 AM, Merlijn van Deen <[email protected]> >> wrote: >> Hello all, >> >> My 'tsreports' webservice randomly dies every now and then. qacct suggests >> this is due to OOM: >> >> tools.tsreports@tools-login:~$ qacct -j 487745 >> qname webgrid-lighttpd >> (...) >> jobname lighttpd-tsreports >> jobnumber 487745 >> (...) >> qsub_time Wed Apr 23 08:18:12 2014 >> start_time Fri May 23 14:30:17 2014 >> end_time Fri Jun 6 10:51:21 2014 >> (...) >> failed 0 >> exit_status 0 >> (...) >> maxvmem 3.973G >> >> >> I have no clue how to debug this, though; the lighttpd error log just shows >> >> 2014-06-06 10:51:20: (mod_fastcgi.c.3061) got proc: pid: 12119 socket: >> unix:/tmp/tsreports-index.fcgi.sock-0 load: 1 >> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 >> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: >> /var/run/lighttpd/tsreports.pid 2 No such file or directory >> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 >> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: >> /var/run/lighttpd/tsreports.pid 2 No such file or directory >> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: >> /var/run/lighttpd/tsreports.pid 2 No such file or directory >> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 >> 2014-06-06 10:51:21: (server.c.1502) unlink failed for: >> /var/run/lighttpd/tsreports.pid 2 No such file or directory >> 2014-06-06 10:51:21: (server.c.1512) server stopped by UID = 0 PID = 12087 >> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 >> >> which is not very informative, to say the least. >> >> So: how can one debug these issues? >> >> To add insult to the injury, SGE doesn't even send an e-mail to tell me it >> killed the webserver, nor does it re-start the webserver. Either of those >> would be reasonable (especially the option 'restart the webserver'). Now I >> had to be notified by someone on my talk page... >> >> Merlijn >> >> _______________________________________________ >> Labs-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/labs-l >> > > > > -- > undefined > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
