> This is the first time you have included the log message: > [Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by signal 9 > This is an important one.
Yep, user error. "LogLevel info" was set in the vhost configuration but not the main Apache config, and this particular error shows up in the main server error_log rather than the vhost error_log > It shows that the process was explicitly killed by something else using equivalent of 'kill -9'. In other words, it isn't crashing by itself but killed off. > This is very indicative of the OOM killer. So double check system logs for evidence of it killing stuff. > Also see if you can somehow set up monitoring of memory usage by processes and see if the size of the Apache processes spikes before these events. Ugh, I just hate for a good lead to not pan out. A kill -9 sent to the process certainly sounds like OOM, but I just can't find any evidence that this is the problem. The "good" news is that I have a mostly-repeatable way now to generate the error - call a somewhat lengthy process (an operation that takes 30-45 seconds to complete) twice simultaneously. Within 60 seconds (usually much less) the sgm-prod process will crash and generate error_log messages (and both clients will receive 500 Internal Server Errors). This significantly cuts down on how long I have to monitor stuff from beginning to end. So here's an example - from server starting up to crashing in 19 seconds: [Tue Sep 01 23:38:15.177559 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15. [Tue Sep 01 23:38:15.183802 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Python home /apps/vmscan/.virtualenvs/sgm. [Tue Sep 01 23:38:15.183862 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Initializing Python. [Tue Sep 01 23:38:15.202600 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Attach interpreter ''. [Tue Sep 01 23:38:15.204691 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Adding '/apps/www/sgm/wsgi/SGM' to path. [Tue Sep 01 23:38:15.204792 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Adding '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path. [Tue Sep 01 23:38:15.206152 2020] [wsgi:debug] [pid 54042:tid 140643793323776] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 13 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206211 2020] [wsgi:debug] [pid 54042:tid 140643894036224] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 1 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206233 2020] [wsgi:debug] [pid 54042:tid 140643868858112] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 4 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206252 2020] [wsgi:debug] [pid 54042:tid 140643843680000] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 7 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206270 2020] [wsgi:debug] [pid 54042:tid 140643818501888] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 10 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206311 2020] [wsgi:debug] [pid 54042:tid 140643885643520] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 2 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206350 2020] [wsgi:debug] [pid 54042:tid 140643877250816] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 3 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206388 2020] [wsgi:debug] [pid 54042:tid 140643902428928] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 0 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206602 2020] [wsgi:debug] [pid 54042:tid 140643860465408] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 5 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206652 2020] [wsgi:debug] [pid 54042:tid 140643852072704] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 6 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206702 2020] [wsgi:debug] [pid 54042:tid 140643835287296] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 8 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206752 2020] [wsgi:debug] [pid 54042:tid 140643826894592] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 9 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206799 2020] [wsgi:debug] [pid 54042:tid 140643810109184] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 11 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.206965 2020] [wsgi:debug] [pid 54042:tid 140643801716480] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 12 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:15.207067 2020] [wsgi:debug] [pid 54042:tid 140643784931072] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 14 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:31.260173 2020] [wsgi:info] [pid 54042:tid 140643793323776] [remote 2001:420:1080:3004::ad25:5fd7:37700] mod_wsgi (pid=54042, process='sgm-prod', application=''): Loading Python script file '/apps/www/sgm/wsgi/SGM/run.py'. [Tue Sep 01 23:38:34.044173 2020] [wsgi:error] [pid 53241:tid 140644082374400] [client 2001:420:1080:3004::ad25:5fd7:37702] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py [Tue Sep 01 23:38:34.044202 2020] [wsgi:error] [pid 53241:tid 140644124337920] [client 2001:420:1080:3004::ad25:5fd7:37700] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py [Tue Sep 01 23:38:34.198481 2020] [wsgi:info] [pid 52360:tid 140644297517952] mod_wsgi (pid=54042): Process 'sgm-prod' has died, deregister and restart it. [Tue Sep 01 23:38:34.198534 2020] [wsgi:info] [pid 52360:tid 140644297517952] mod_wsgi (pid=54042): Process 'sgm-prod' terminated by signal 9 [Tue Sep 01 23:38:34.198566 2020] [wsgi:info] [pid 52360:tid 140644297517952] mod_wsgi (pid=54042): Process 'sgm-prod' has been deregistered and will no longer be monitored. [Tue Sep 01 23:38:34.199931 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15. [Tue Sep 01 23:38:34.204659 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Python home /apps/vmscan/.virtualenvs/sgm. [Tue Sep 01 23:38:34.204707 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Initializing Python. [Tue Sep 01 23:38:34.222980 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Attach interpreter ''. [Tue Sep 01 23:38:34.225049 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Adding '/apps/www/sgm/wsgi/SGM' to path. [Tue Sep 01 23:38:34.225148 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Adding '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path. [Tue Sep 01 23:38:34.226427 2020] [wsgi:debug] [pid 54088:tid 140643877250816] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 3 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226486 2020] [wsgi:debug] [pid 54088:tid 140643852072704] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 6 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226511 2020] [wsgi:debug] [pid 54088:tid 140643826894592] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 9 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226528 2020] [wsgi:debug] [pid 54088:tid 140643801716480] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 12 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226546 2020] [wsgi:debug] [pid 54088:tid 140643902428928] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 0 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226669 2020] [wsgi:debug] [pid 54088:tid 140643885643520] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 2 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226715 2020] [wsgi:debug] [pid 54088:tid 140643894036224] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 1 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226743 2020] [wsgi:debug] [pid 54088:tid 140643860465408] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 5 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226767 2020] [wsgi:debug] [pid 54088:tid 140643868858112] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 4 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226791 2020] [wsgi:debug] [pid 54088:tid 140643835287296] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 8 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.226814 2020] [wsgi:debug] [pid 54088:tid 140643843680000] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 7 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.227063 2020] [wsgi:debug] [pid 54088:tid 140643810109184] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 11 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.227117 2020] [wsgi:debug] [pid 54088:tid 140643818501888] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 10 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.227936 2020] [wsgi:debug] [pid 54088:tid 140643784931072] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 14 in daemon process 'sgm-prod'. [Tue Sep 01 23:38:34.228030 2020] [wsgi:debug] [pid 54088:tid 140643793323776] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 13 in daemon process 'sgm-prod'. And here's "vmstat 1" during that time. Free RAM is fine, and available cache RAM never drops below 53GB. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 990976 1296072 8044 55966592 0 0 15 53 1 1 12 3 85 0 0 0 0 990976 1295848 8044 55966592 0 0 0 4 311 398 0 0 100 0 0 0 0 990976 1295168 8044 55966592 0 0 0 0 4199 2938 2 2 97 0 0 1 0 990976 1268080 8044 55966592 0 0 0 0 4752 3559 9 4 87 0 0 1 0 990976 1225368 8044 55966712 0 0 0 0 2311 1228 33 2 65 0 0 1 0 990976 1139704 8044 55966712 0 0 0 0 2335 860 55 3 42 0 0 1 0 990976 1082504 8044 55966712 0 0 0 20 2862 626 35 3 62 0 0 2 0 990976 1013720 8044 55990828 0 0 0 0 2349 370 64 2 34 0 0 2 0 990976 969116 8044 55990828 0 0 0 186 2288 571 66 2 32 0 0 2 0 990976 925216 8044 55990828 0 0 0 0 2342 429 66 1 33 0 0 2 0 990976 917288 8044 55966712 0 0 0 0 3180 1632 43 4 53 0 0 2 0 990976 882840 8044 55966716 0 0 0 20 3742 2074 32 3 65 0 0 0 0 990976 875392 8044 55966724 0 0 0 44 8244 5675 18 5 77 0 0 1 0 990976 1062116 8044 55966744 0 0 32 0 3613 2515 31 3 66 0 0 0 0 990976 1290828 8044 55966864 0 0 0 320 2072 1016 31 1 68 0 0 0 0 990976 1290604 8044 55966864 0 0 0 740 731 822 0 1 99 0 0 0 0 990976 1290352 8044 55966864 0 0 0 0 563 635 0 0 100 0 0 0 0 990976 1289736 8044 55966864 0 0 0 0 796 894 0 0 100 0 0 "dmesg" produces no output. "/var/log/messages" has no logs during that time period (other than the sudo log of me running the two client programs simultaneously). On a RHEL 8 system there doesn't seem to be any other place that OOM logs would be found. I'm going to start trying to use "perf" and "strace" to dig into the SIGKILL's themselves and see if I can coax Linux to tell me anything about what's killing off those threads. On Tuesday, September 1, 2020 at 4:41:11 PM UTC-4 Graham Dumpleton wrote: > This is the first time you have included the log message: > > [Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid > 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by > signal 9 > > This is an important one. > > It shows that the process was explicitly killed by something else using > equivalent of 'kill -9'. In other words, it isn't crashing by itself but > killed off. > > This is very indicative of the OOM killer. So double check system logs for > evidence of it killing stuff. > > Also see if you can somehow set up monitoring of memory usage by processes > and see if the size of the Apache processes spikes before these events. > > On 2 Sep 2020, at 4:48 am, David White <[email protected]> wrote: > > Ugh - another crash, but no crashdump generated. > > */apps/www/sgm/logs/error.log (Apache vhost error log):* > [Tue Sep 01 13:36:51.204652 2020] [wsgi:error] [pid 98990:tid > 139766439843584] [client 2001:420:1080:3004::ad25:5fd7:52564] Truncated or > oversized response headers received from daemon process 'sgm-prod': > /apps/www/sgm/wsgi/SGM/run.py > [Tue Sep 01 13:36:51.204662 2020] [wsgi:error] [pid 97668:tid > 139767068935936] [client 2001:420:1080:3004::ad25:5fd7:52588] Truncated or > oversized response headers received from daemon process 'sgm-prod': > /apps/www/sgm/wsgi/SGM/run.py > [Tue Sep 01 13:36:51.672700 2020] [wsgi:info] [pid 111735:tid > 139767496833920] mod_wsgi (pid=111735): Attach interpreter ''. > [Tue Sep 01 13:36:51.674843 2020] [wsgi:info] [pid 111735:tid > 139767496833920] mod_wsgi (pid=111735): Adding '/apps/www/sgm/wsgi/SGM' to > path. > [Tue Sep 01 13:36:51.674962 2020] [wsgi:info] [pid 111735:tid > 139767496833920] mod_wsgi (pid=111735): Adding > '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path. > [Tue Sep 01 13:36:51.675940 2020] [wsgi:debug] [pid 111735:tid > 139767101744896] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 0 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676191 2020] [wsgi:debug] [pid 111735:tid > 139767084959488] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 2 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676297 2020] [wsgi:debug] [pid 111735:tid > 139767076566784] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 3 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676304 2020] [wsgi:debug] [pid 111735:tid > 139767093352192] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 1 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676326 2020] [wsgi:debug] [pid 111735:tid > 139767068174080] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 4 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676517 2020] [wsgi:debug] [pid 111735:tid > 139767059781376] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 5 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676593 2020] [wsgi:debug] [pid 111735:tid > 139767051388672] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 6 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676642 2020] [wsgi:debug] [pid 111735:tid > 139767034603264] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 8 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676694 2020] [wsgi:debug] [pid 111735:tid > 139767017817856] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 10 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676752 2020] [wsgi:debug] [pid 111735:tid > 139767009425152] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 11 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676806 2020] [wsgi:debug] [pid 111735:tid > 139767001032448] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 12 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676844 2020] [wsgi:debug] [pid 111735:tid > 139767026210560] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 9 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676856 2020] [wsgi:debug] [pid 111735:tid > 139766992639744] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 13 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676930 2020] [wsgi:debug] [pid 111735:tid > 139766984247040] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 14 in daemon process 'sgm-prod'. > [Tue Sep 01 13:36:51.676952 2020] [wsgi:debug] [pid 111735:tid > 139767042995968] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): > Started thread 7 in daemon process 'sgm-prod'. > > */apps/apache2/logs/error.log (Apache system log):* > [Tue Sep 01 13:36:51.640137 2020] [wsgi:info] [pid 94912:tid > 139767496833920] mod_wsgi (pid=97659): Process 'sgm-prod' has died, > deregister and restart it. > [Tue Sep 01 13:36:51.640203 2020] [wsgi:info] [pid 94912:tid > 139767496833920] mod_wsgi (pid=97659): Process 'sgm-prod' terminated by > signal 9 > [Tue Sep 01 13:36:51.640210 2020] [wsgi:info] [pid 94912:tid > 139767496833920] mod_wsgi (pid=97659): Process 'sgm-prod' has been > deregistered and will no longer be monitored. > [Tue Sep 01 13:36:51.641412 2020] [wsgi:info] [pid 111735:tid > 139767496833920] mod_wsgi (pid=111735): Starting process 'sgm-prod' with > uid=50004207, gid=25 and threads=15. > [Tue Sep 01 13:36:51.654168 2020] [wsgi:info] [pid 111735:tid > 139767496833920] mod_wsgi (pid=111735): Python home > /apps/vmscan/.virtualenvs/sgm. > [Tue Sep 01 13:36:51.654251 2020] [wsgi:info] [pid 111735:tid > 139767496833920] mod_wsgi (pid=111735): Initializing Python. > > *ls -la /var/crash:* > total 4 > drwxrwxrwx. 2 root root 6 Sep 1 13:47 . > drwxr-xr-x. 26 root root 4096 Jun 3 20:00 .. > > On Tuesday, September 1, 2020 at 2:04:30 PM UTC-4 David White wrote: > >> This is such a cryptic thing to debug. I can't find any sign of a core >> dump or segmentation fault in any of the Apache main logs, the Apache vhost >> logs, or the system logs in /var/log . The main Apache server log (now >> with "LogLevel debug") shows only the crash event and nothing else >> particularly helpful. The sgm-prod process can survive for hours, or it >> may die every 3-4 minutes. >> >> [Tue Sep 01 09:00:49.625195 2020] [wsgi:info] [pid 130799:tid >> 140547110185856] mod_wsgi (pid=130799): Initializing Python. >> [Tue Sep 01 11:02:27.218264 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has died, >> deregister and restart it. >> [Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by >> signal 9 >> [Tue Sep 01 11:02:27.218317 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has been >> deregistered and will no longer be monitored. >> [Tue Sep 01 11:02:27.219893 2020] [wsgi:info] [pid 47466:tid >> 140547110185856] mod_wsgi (pid=47466): Starting process 'sgm-prod' with >> uid=50004207, gid=25 and threads=15. >> [Tue Sep 01 11:02:27.230831 2020] [wsgi:info] [pid 47466:tid >> 140547110185856] mod_wsgi (pid=47466): Python home >> /apps/vmscan/.virtualenvs/sgm. >> [Tue Sep 01 11:02:27.230893 2020] [wsgi:info] [pid 47466:tid >> 140547110185856] mod_wsgi (pid=47466): Initializing Python. >> >> [Tue Sep 01 11:04:44.365899 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has died, >> deregister and restart it. >> [Tue Sep 01 11:04:44.366026 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' terminated by >> signal 9 >> [Tue Sep 01 11:04:44.366035 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has been >> deregistered and will no longer be monitored. >> [Tue Sep 01 11:04:44.367023 2020] [wsgi:info] [pid 47872:tid >> 140547110185856] mod_wsgi (pid=47872): Starting process 'sgm-prod' with >> uid=50004207, gid=25 and threads=15. >> [Tue Sep 01 11:04:44.374176 2020] [wsgi:info] [pid 47872:tid >> 140547110185856] mod_wsgi (pid=47872): Python home >> /apps/vmscan/.virtualenvs/sgm. >> [Tue Sep 01 11:04:44.374232 2020] [wsgi:info] [pid 47872:tid >> 140547110185856] mod_wsgi (pid=47872): Initializing Python. >> >> [Tue Sep 01 11:07:29.569996 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has died, >> deregister and restart it. >> [Tue Sep 01 11:07:29.570049 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' terminated by >> signal 9 >> [Tue Sep 01 11:07:29.570054 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has been >> deregistered and will no longer be monitored. >> [Tue Sep 01 11:07:29.571142 2020] [wsgi:info] [pid 48657:tid >> 140547110185856] mod_wsgi (pid=48657): Starting process 'sgm-prod' with >> uid=50004207, gid=25 and threads=15. >> [Tue Sep 01 11:07:29.577718 2020] [wsgi:info] [pid 48657:tid >> 140547110185856] mod_wsgi (pid=48657): Python home >> /apps/vmscan/.virtualenvs/sgm. >> [Tue Sep 01 11:07:29.577776 2020] [wsgi:info] [pid 48657:tid >> 140547110185856] mod_wsgi (pid=48657): Initializing Python. >> >> [Tue Sep 01 11:10:18.754372 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has died, >> deregister and restart it. >> [Tue Sep 01 11:10:18.754433 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' terminated by >> signal 9 >> [Tue Sep 01 11:10:18.754438 2020] [wsgi:info] [pid 15161:tid >> 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has been >> deregistered and will no longer be monitored. >> [Tue Sep 01 11:10:18.755506 2020] [wsgi:info] [pid 49225:tid >> 140547110185856] mod_wsgi (pid=49225): Starting process 'sgm-prod' with >> uid=50004207, gid=25 and threads=15. >> [Tue Sep 01 11:10:18.762162 2020] [wsgi:info] [pid 49225:tid >> 140547110185856] mod_wsgi (pid=49225): Python home >> /apps/vmscan/.virtualenvs/sgm. >> [Tue Sep 01 11:10:18.762215 2020] [wsgi:info] [pid 49225:tid >> 140547110185856] mod_wsgi (pid=49225): Initializing Python. >> >> Memory and OOM don't seem to be a factor - this is a beefy virtual >> machine with 58GB RAM, rarely ever exceeding 5-6 GB even under load. No >> "process killed" messages in dmesg or /var/log/messages . >> >> It may be that Apache wasn't properly configured for core dumps. I've >> made some config changes and did a "kill -SIGSEGV" on both the main httpd >> process and (after restarting) a thread process. Both generated crash >> dumps in /tmp/crash , so I'll wait to see if I get anything useful in there >> the next time the app crashes for real. >> >> You asked what third-party packages are part of the virtual environment, >> and it's pretty extensive. Flask+SQLAlchemy make up the core of the >> application, and there are a lot of libraries that either came along for >> the ride, or were manually installed. >> >> aniso8601==8.0.0 >> bcrypt==3.2.0 >> blinker==1.4 >> Cerberus==1.3.2 >> certifi==2020.6.20 >> cffi==1.14.2 >> chardet==3.0.4 >> click==7.1.2 >> colorama==0.4.3 >> coverage==5.2.1 >> cx-Oracle==5.3 >> dateparser==0.7.6 >> Eve==1.1.2 >> Eve-SQLAlchemy==0.7.1 >> Events==0.3 >> Flask==1.1.2 >> Flask-Admin==1.5.6 >> Flask-Bcrypt==0.7.1 >> Flask-DebugToolbar==0.11.0 >> flask-ldap3-login==0.9.16 >> Flask-LDAPConn==0.10.1 >> Flask-Login==0.4.1 >> Flask-PyMongo==2.3.0 >> Flask-RESTful==0.3.8 >> Flask-Restless==0.17.0 >> Flask-SQLAlchemy==2.4.4 >> Flask-WTF==0.14.3 >> httpie==2.2.0 >> idna==2.10 >> inflect==4.1.0 >> itsdangerous==1.1.0 >> jdatetime==3.6.2 >> Jinja2==2.11.2 >> jinja2-pluralize==0.3.0 >> jsmin==2.2.2 >> ldap3==2.8 >> lxml==4.5.2 >> MarkupSafe==1.1.1 >> marshmallow==2.10.3 >> mimerender==0.6.0 >> netaddr==0.8.0 >> nose2==0.9.2 >> ordered-set==4.0.2 >> pyasn1==0.4.8 >> pycparser==2.20 >> Pygments==2.6.1 >> pymongo==3.11.0 >> python-dateutil==2.8.1 >> python-mimeparse==1.6.0 >> pytz==2020.1 >> PyYAML==5.3.1 >> redis==3.5.3 >> regex==2020.7.14 >> requests==2.24.0 >> ruamel.yaml==0.16.10 >> ruamel.yaml.clib==0.2.0 >> simplejson==3.17.2 >> six==1.15.0 >> SQLAlchemy==1.3.19 >> sqlalchemy-datatables==2.0.1 >> sqlalchemy-views==0.2.4 >> strict-rfc3339==0.7 >> tzlocal==2.1 >> umalqurra==0.2 >> urllib3==1.25.10 >> webargs==1.4.0 >> Werkzeug==1.0.1 >> WTForms==2.3.3 >> xmljson==0.2.1 >> xmltodict==0.12.0 >> >> I'll keep fighting with it. Thanks again for the suggestions, Graham. >> >> On Monday, August 31, 2020 at 11:48:31 PM UTC-4 Graham Dumpleton wrote: >> >>> And another possibility is that your Linux OOM killer is killing the >>> processes because they are using up too much memory. >>> >>> https://docs.memset.com/other/linux-s-oom-process-killer >>> >>> Apache is quite susceptible to being killed by it because of how it uses >>> resources. >>> >>> Check the logs it describes to see if is being triggered. >>> >>> >>> On 1 Sep 2020, at 9:18 am, Graham Dumpleton <[email protected]> >>> wrote: >>> >>> There must be a core dump or segmentation fault message in main Apache >>> error logs if the daemon process is crashing. >>> >>> One other remote possibility have seen with some third party libraries >>> (C libraries, not so much Python wrappers), is that when they encounter an >>> error, they call C library exit(), stopping the process. This results in no >>> crash report. >>> >>> What are the third party Python packages you are using? >>> >>> Graham >>> >>> On 1 Sep 2020, at 7:58 am, David White <[email protected]> wrote: >>> >>> *> another cause can be that you are using an external system to >>> trigger log file rotation * >>> >>> That is an inspired guess! I am in fact using "cronolog" to mange >>> logfile rotation. I don't *think* it's the culprit, since it rotates >>> only monthly and crashes can happen many times a day (sometimes minutes >>> apart). But removing it can at least narrow the variables down. >>> >>> > requests that are getting stuck, and eventually you hit that socket >>> timeout >>> >>> It appears the crashes are pretty much immediate, within a second of a >>> request (or multiple requests) coming in. The server may run 10 hours >>> before the next crash, or it may crash again within 5 minutes. There does >>> seem to be some correlation between request load and crashing likelihood - >>> crashes almost always happen overnight when multiple cron jobs are hitting >>> the program at once. >>> >>> Meanwhile I'll stuff "WSGIRestrictEmbedded On" into the main httpd.conf >>> and see what happens. >>> >>> I appreciate the debugging advice. >>> >>> On Monday, August 31, 2020 at 5:16:48 PM UTC-4 Graham Dumpleton wrote: >>> >>>> >>>> On 1 Sep 2020, at 7:06 am, David White <[email protected]> wrote: >>>> >>>> Unfortunately the main Apache log shows nothing except normal >>>> startup/shutdown messages. If the "sgm-prod" threads are being terminated >>>> and restarted, they are leaving no indication of that in the main server >>>> logs. >>>> >>>> The only clue appears to be that the crashing occurs during more heavy >>>> loads. The application does not often strain the CPU/RAM of the >>>> underlying >>>> host, but the Apache vhost logs show the crashes seem to occur when >>>> multiple requests are being handled nearly simultaneously. >>>> >>>> [Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid >>>> 140546546816768] [client 10.83.210.200:47266] Truncated or oversized >>>> response headers received from daemon process 'sgm-prod': >>>> /apps/www/sgm/wsgi/SGM/run.py >>>> [Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid >>>> 140546571994880] [client 10.31.82.142:35810] Truncated or oversized >>>> response headers received from daemon process 'sgm-prod': >>>> /apps/www/sgm/wsgi/SGM/run.py >>>> >>>> Am I likely to see any benefit to either: >>>> >>>> 1. Reducing the process or thread count passed to >>>> WSGIDaemonProcess, or >>>> 2. Putting Apache into "prefork" MPM? >>>> >>>> Never use prefork MPM if you can avoid it. >>>> >>>> Also ensure that outside of VirtualHost you have: >>>> >>>> WSGIRestrictEmbedded On >>>> >>>> to ensure you are never accidentally running in the main Apache child >>>> worker processes. >>>> >>>> If there is no evidence of a crash, then another cause can be that you >>>> are using an external system to trigger log file rotation, rather than >>>> recommended method of using Apache's own log file rotation method. >>>> >>>> An external log file rotation system usually only fires once per day, >>>> but maybe yours is set up differently based on size of logs. >>>> >>>> The problem with an external log file rotation system is that it >>>> signals Apache to restart. Although the main Apache child worker process >>>> are given a grace period to finish requests, the way Apache libraries >>>> implement management of third party processes such as the mod_wsgi daemon >>>> processes is that it will kill them after 5 seconds if they don't shutdown >>>> quick enough. This results in requests being proxied by Apache child >>>> worker >>>> processes being chopped off and you see that message. >>>> >>>> If it is this though, you should see clear messages at info level in >>>> logs from mod_wsgi that the daemon processes were being shutdown and >>>> restarted. This will appear some time before you see that message. >>>> >>>> Only other thing can think of is that you have requests that are >>>> getting stuck, and eventually you hit that socket timeout, although you >>>> have that set very large, so would take a request to be stuck for 6000 >>>> seconds before you saw it. >>>> >>>> Only way to tell if may be that is to use: >>>> >>>> >>>> https://modwsgi.readthedocs.io/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces >>>> >>>> so can signal mod_wsgi to dump Python stack traces periodically and see >>>> if a specific request is stuck some where. >>>> >>>> Graham >>>> >>>> Thanks, Graham. >>>> >>>> On Monday, August 31, 2020 at 4:56:46 PM UTC-4 Graham Dumpleton wrote: >>>> >>>>> Look in the main Apache error log, not the virtual host, and you will >>>>> probably find a message about a segmentation fault or other error. >>>>> >>>>> Where it is quite random like this, unless you can enable capture of >>>>> the core dump file and can run a debugger (gdb) on that to get a stack >>>>> trace, is going to be hard to track it down. >>>>> >>>>> Only other thing can suggest is watching the process size to see if >>>>> getting so large that you are running out of memory and a failure to >>>>> allocate memory causes something to crash. >>>>> >>>>> Graham >>>>> >>>>> On 1 Sep 2020, at 3:28 am, David White <[email protected]> wrote: >>>>> >>>>> >>>>> Hello. I am running Apache 2.4.43 with mod_wsgi 4.71 compiled against >>>>> Python 3.8.3 (all manually compiled, not part of the RHEL 8.1 distro). >>>>> Apache MPM is "event". >>>>> >>>>> The application running in one of my virtual hosts will occasionally >>>>> crash, but randomly. Repeating the same request immediately will usually >>>>> succeed. The application may continue working for hours before randomly >>>>> crashing again. >>>>> >>>>> This started happening recently with an upgrade of the Flask and >>>>> SQLAlchemy modules. (again, all manually installed via pip) >>>>> >>>>> A crash is reported this way in the vhost's error_log: >>>>> >>>>> [Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid >>>>> 140546546816768] [client 10.83.210.200:47266] Truncated or oversized >>>>> response headers received from daemon process 'sgm-prod': >>>>> /apps/www/sgm/wsgi/SGM/run.py >>>>> [Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid >>>>> 140546571994880] [client 10.31.82.142:35810] Truncated or oversized >>>>> response headers received from daemon process 'sgm-prod': >>>>> /apps/www/sgm/wsgi/SGM/run.py >>>>> [Mon Aug 31 11:11:08.512262 2020] [wsgi:info] [pid 40601:tid >>>>> 140547110185856] mod_wsgi (pid=40601): Attach interpreter ''. >>>>> [Mon Aug 31 11:11:08.514415 2020] [wsgi:info] [pid 40601:tid >>>>> 140547110185856] mod_wsgi (pid=40601): Adding '/apps/www/sgm/wsgi/SGM' to >>>>> path. >>>>> [Mon Aug 31 11:11:08.514526 2020] [wsgi:info] [pid 40601:tid >>>>> 140547110185856] mod_wsgi (pid=40601): Adding >>>>> '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path. >>>>> [Mon Aug 31 11:11:08.515520 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546715096832] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 0 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515575 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546706704128] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 1 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515629 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546698311424] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 2 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515702 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546689918720] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 3 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515737 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546681526016] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 4 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515766 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546673133312] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 5 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515795 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546664740608] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 6 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515821 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546656347904] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 7 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515853 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546647955200] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 8 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515921 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546639562496] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 9 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515948 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546631169792] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 10 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.515972 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546622777088] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 11 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.516002 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546614384384] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 12 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.516036 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546605991680] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 13 in daemon process 'sgm-prod'. >>>>> [Mon Aug 31 11:11:08.516061 2020] [wsgi:debug] [pid 40601:tid >>>>> 140546597598976] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): >>>>> Started >>>>> thread 14 in daemon process 'sgm-prod'. >>>>> >>>>> The WSGI portion of the configuration for the vhost in Apache looks >>>>> like this: >>>>> >>>>> WSGIPassAuthorization On >>>>> LogLevel info wsgi:trace6 >>>>> SetEnv SGM_PRODUCTION 1 >>>>> SetEnv SGM_USE_ORACLE 1 >>>>> WSGIDaemonProcess sgm-prod user=sgmuser group=sgmgroup threads=15 >>>>> python-home=/apps/sgmuser/.virtualenvs/sgm >>>>> python-path=/apps/www/sgm/wsgi/SGM:/apps/sgmuser/.virtualenvs/sgm/lib/python3.8/site-packages >>>>> >>>>> socket-timeout=6000 >>>>> WSGIScriptAlias / /apps/www/sgm/wsgi/SGM/run.py >>>>> <Directory /apps/www/sgm/wsgi/SGM> >>>>> Require all granted >>>>> AllowOverride AuthConfig >>>>> WSGIProcessGroup sgm-prod >>>>> WSGIApplicationGroup %{GLOBAL} >>>>> </Directory> >>>>> >>>>> The main Apache error log shows nothing relevant, and the >>>>> application-specific logs (with Python logging messages) show only that >>>>> the >>>>> request was made. There is no exception traceback data being logged (I'm >>>>> guessing the app is crashing before that can happen.) >>>>> >>>>> Given how transitory this error is, I'm not sure how else to configure >>>>> Apache or the SGM app itself to get more details about why the "sgm-prod" >>>>> process seems to be crashing. >>>>> >>>>> Any help would be appreciated! Thanks. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "modwsgi" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/modwsgi/8df00cf3-abeb-43a8-b14a-6666af322a9an%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/modwsgi/8df00cf3-abeb-43a8-b14a-6666af322a9an%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "modwsgi" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/modwsgi/c5769203-50ef-4501-a3ac-ada2ddc6caaen%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/modwsgi/c5769203-50ef-4501-a3ac-ada2ddc6caaen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "modwsgi" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/modwsgi/7d8ddafa-3465-4207-8762-b64f3b3a5de3n%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/modwsgi/7d8ddafa-3465-4207-8762-b64f3b3a5de3n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> >>> >>> > -- > You received this message because you are subscribed to the Google Groups > "modwsgi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/modwsgi/3aac2f25-d903-40d7-bef6-3aade97567fcn%40googlegroups.com > > <https://groups.google.com/d/msgid/modwsgi/3aac2f25-d903-40d7-bef6-3aade97567fcn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/cf36529b-fe53-4358-8d22-097e571f7541n%40googlegroups.com.
