Hello all, My 'tsreports' webservice randomly dies every now and then. qacct suggests this is due to OOM:
tools.tsreports@tools-login:~$ qacct -j 487745 qname webgrid-lighttpd (...) jobname lighttpd-tsreports jobnumber 487745 (...) qsub_time Wed Apr 23 08:18:12 2014 start_time Fri May 23 14:30:17 2014 end_time Fri Jun 6 10:51:21 2014 (...) failed 0 exit_status 0 (...) maxvmem 3.973G I have no clue how to debug this, though; the lighttpd error log just shows 2014-06-06 10:51:20: (mod_fastcgi.c.3061) got proc: pid: 12119 socket: unix:/tmp/tsreports-index.fcgi.sock-0 load: 1 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 2014-06-06 10:51:21: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory 2014-06-06 10:51:21: (server.c.1512) server stopped by UID = 0 PID = 12087 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087 which is not very informative, to say the least. So: how can one debug these issues? To add insult to the injury, SGE doesn't even send an e-mail to tell me it killed the webserver, nor does it re-start the webserver. Either of those would be reasonable (especially the option 'restart the webserver'). Now I had to be notified by someone on my talk page... Merlijn
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
