This is such a cryptic thing to debug. I can't find any sign of a core dump or segmentation fault in any of the Apache main logs, the Apache vhost logs, or the system logs in /var/log . The main Apache server log (now with "LogLevel debug") shows only the crash event and nothing else particularly helpful. The sgm-prod process can survive for hours, or it may die every 3-4 minutes.
[Tue Sep 01 09:00:49.625195 2020] [wsgi:info] [pid 130799:tid 140547110185856] mod_wsgi (pid=130799): Initializing Python. [Tue Sep 01 11:02:27.218264 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has died, deregister and restart it. [Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by signal 9 [Tue Sep 01 11:02:27.218317 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has been deregistered and will no longer be monitored. [Tue Sep 01 11:02:27.219893 2020] [wsgi:info] [pid 47466:tid 140547110185856] mod_wsgi (pid=47466): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15. [Tue Sep 01 11:02:27.230831 2020] [wsgi:info] [pid 47466:tid 140547110185856] mod_wsgi (pid=47466): Python home /apps/vmscan/.virtualenvs/sgm. [Tue Sep 01 11:02:27.230893 2020] [wsgi:info] [pid 47466:tid 140547110185856] mod_wsgi (pid=47466): Initializing Python. [Tue Sep 01 11:04:44.365899 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has died, deregister and restart it. [Tue Sep 01 11:04:44.366026 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' terminated by signal 9 [Tue Sep 01 11:04:44.366035 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has been deregistered and will no longer be monitored. [Tue Sep 01 11:04:44.367023 2020] [wsgi:info] [pid 47872:tid 140547110185856] mod_wsgi (pid=47872): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15. [Tue Sep 01 11:04:44.374176 2020] [wsgi:info] [pid 47872:tid 140547110185856] mod_wsgi (pid=47872): Python home /apps/vmscan/.virtualenvs/sgm. [Tue Sep 01 11:04:44.374232 2020] [wsgi:info] [pid 47872:tid 140547110185856] mod_wsgi (pid=47872): Initializing Python. [Tue Sep 01 11:07:29.569996 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has died, deregister and restart it. [Tue Sep 01 11:07:29.570049 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' terminated by signal 9 [Tue Sep 01 11:07:29.570054 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has been deregistered and will no longer be monitored. [Tue Sep 01 11:07:29.571142 2020] [wsgi:info] [pid 48657:tid 140547110185856] mod_wsgi (pid=48657): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15. [Tue Sep 01 11:07:29.577718 2020] [wsgi:info] [pid 48657:tid 140547110185856] mod_wsgi (pid=48657): Python home /apps/vmscan/.virtualenvs/sgm. [Tue Sep 01 11:07:29.577776 2020] [wsgi:info] [pid 48657:tid 140547110185856] mod_wsgi (pid=48657): Initializing Python. [Tue Sep 01 11:10:18.754372 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has died, deregister and restart it. [Tue Sep 01 11:10:18.754433 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' terminated by signal 9 [Tue Sep 01 11:10:18.754438 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has been deregistered and will no longer be monitored. [Tue Sep 01 11:10:18.755506 2020] [wsgi:info] [pid 49225:tid 140547110185856] mod_wsgi (pid=49225): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15. [Tue Sep 01 11:10:18.762162 2020] [wsgi:info] [pid 49225:tid 140547110185856] mod_wsgi (pid=49225): Python home /apps/vmscan/.virtualenvs/sgm. [Tue Sep 01 11:10:18.762215 2020] [wsgi:info] [pid 49225:tid 140547110185856] mod_wsgi (pid=49225): Initializing Python. Memory and OOM don't seem to be a factor - this is a beefy virtual machine with 58GB RAM, rarely ever exceeding 5-6 GB even under load. No "process killed" messages in dmesg or /var/log/messages . It may be that Apache wasn't properly configured for core dumps. I've made some config changes and did a "kill -SIGSEGV" on both the main httpd process and (after restarting) a thread process. Both generated crash dumps in /tmp/crash , so I'll wait to see if I get anything useful in there the next time the app crashes for real. You asked what third-party packages are part of the virtual environment, and it's pretty extensive. Flask+SQLAlchemy make up the core of the application, and there are a lot of libraries that either came along for the ride, or were manually installed. aniso8601==8.0.0 bcrypt==3.2.0 blinker==1.4 Cerberus==1.3.2 certifi==2020.6.20 cffi==1.14.2 chardet==3.0.4 click==7.1.2 colorama==0.4.3 coverage==5.2.1 cx-Oracle==5.3 dateparser==0.7.6 Eve==1.1.2 Eve-SQLAlchemy==0.7.1 Events==0.3 Flask==1.1.2 Flask-Admin==1.5.6 Flask-Bcrypt==0.7.1 Flask-DebugToolbar==0.11.0 flask-ldap3-login==0.9.16 Flask-LDAPConn==0.10.1 Flask-Login==0.4.1 Flask-PyMongo==2.3.0 Flask-RESTful==0.3.8 Flask-Restless==0.17.0 Flask-SQLAlchemy==2.4.4 Flask-WTF==0.14.3 httpie==2.2.0 idna==2.10 inflect==4.1.0 itsdangerous==1.1.0 jdatetime==3.6.2 Jinja2==2.11.2 jinja2-pluralize==0.3.0 jsmin==2.2.2 ldap3==2.8 lxml==4.5.2 MarkupSafe==1.1.1 marshmallow==2.10.3 mimerender==0.6.0 netaddr==0.8.0 nose2==0.9.2 ordered-set==4.0.2 pyasn1==0.4.8 pycparser==2.20 Pygments==2.6.1 pymongo==3.11.0 python-dateutil==2.8.1 python-mimeparse==1.6.0 pytz==2020.1 PyYAML==5.3.1 redis==3.5.3 regex==2020.7.14 requests==2.24.0 ruamel.yaml==0.16.10 ruamel.yaml.clib==0.2.0 simplejson==3.17.2 six==1.15.0 SQLAlchemy==1.3.19 sqlalchemy-datatables==2.0.1 sqlalchemy-views==0.2.4 strict-rfc3339==0.7 tzlocal==2.1 umalqurra==0.2 urllib3==1.25.10 webargs==1.4.0 Werkzeug==1.0.1 WTForms==2.3.3 xmljson==0.2.1 xmltodict==0.12.0 I'll keep fighting with it. Thanks again for the suggestions, Graham. On Monday, August 31, 2020 at 11:48:31 PM UTC-4 Graham Dumpleton wrote: > And another possibility is that your Linux OOM killer is killing the > processes because they are using up too much memory. > > https://docs.memset.com/other/linux-s-oom-process-killer > > Apache is quite susceptible to being killed by it because of how it uses > resources. > > Check the logs it describes to see if is being triggered. > > > On 1 Sep 2020, at 9:18 am, Graham Dumpleton <[email protected]> wrote: > > There must be a core dump or segmentation fault message in main Apache > error logs if the daemon process is crashing. > > One other remote possibility have seen with some third party libraries (C > libraries, not so much Python wrappers), is that when they encounter an > error, they call C library exit(), stopping the process. This results in no > crash report. > > What are the third party Python packages you are using? > > Graham > > On 1 Sep 2020, at 7:58 am, David White <[email protected]> wrote: > > *> another cause can be that you are using an external system to trigger > log file rotation * > > That is an inspired guess! I am in fact using "cronolog" to mange logfile > rotation. I don't *think* it's the culprit, since it rotates only > monthly and crashes can happen many times a day (sometimes minutes apart). > But removing it can at least narrow the variables down. > > > requests that are getting stuck, and eventually you hit that socket > timeout > > It appears the crashes are pretty much immediate, within a second of a > request (or multiple requests) coming in. The server may run 10 hours > before the next crash, or it may crash again within 5 minutes. There does > seem to be some correlation between request load and crashing likelihood - > crashes almost always happen overnight when multiple cron jobs are hitting > the program at once. > > Meanwhile I'll stuff "WSGIRestrictEmbedded On" into the main httpd.conf > and see what happens. > > I appreciate the debugging advice. > > On Monday, August 31, 2020 at 5:16:48 PM UTC-4 Graham Dumpleton wrote: > >> >> On 1 Sep 2020, at 7:06 am, David White <[email protected]> wrote: >> >> Unfortunately the main Apache log shows nothing except normal >> startup/shutdown messages. If the "sgm-prod" threads are being terminated >> and restarted, they are leaving no indication of that in the main server >> logs. >> >> The only clue appears to be that the crashing occurs during more heavy >> loads. The application does not often strain the CPU/RAM of the underlying >> host, but the Apache vhost logs show the crashes seem to occur when >> multiple requests are being handled nearly simultaneously. >> >> [Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid >> 140546546816768] [client 10.83.210.200:47266] Truncated or oversized >> response headers received from daemon process 'sgm-prod': >> /apps/www/sgm/wsgi/SGM/run.py >> [Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid >> 140546571994880] [client 10.31.82.142:35810] Truncated or oversized >> response headers received from daemon process 'sgm-prod': >> /apps/www/sgm/wsgi/SGM/run.py >> >> Am I likely to see any benefit to either: >> >> 1. Reducing the process or thread count passed to WSGIDaemonProcess, >> or >> 2. Putting Apache into "prefork" MPM? >> >> Never use prefork MPM if you can avoid it. >> >> Also ensure that outside of VirtualHost you have: >> >> WSGIRestrictEmbedded On >> >> to ensure you are never accidentally running in the main Apache child >> worker processes. >> >> If there is no evidence of a crash, then another cause can be that you >> are using an external system to trigger log file rotation, rather than >> recommended method of using Apache's own log file rotation method. >> >> An external log file rotation system usually only fires once per day, but >> maybe yours is set up differently based on size of logs. >> >> The problem with an external log file rotation system is that it signals >> Apache to restart. Although the main Apache child worker process are given >> a grace period to finish requests, the way Apache libraries implement >> management of third party processes such as the mod_wsgi daemon processes >> is that it will kill them after 5 seconds if they don't shutdown quick >> enough. This results in requests being proxied by Apache child worker >> processes being chopped off and you see that message. >> >> If it is this though, you should see clear messages at info level in logs >> from mod_wsgi that the daemon processes were being shutdown and restarted. >> This will appear some time before you see that message. >> >> Only other thing can think of is that you have requests that are getting >> stuck, and eventually you hit that socket timeout, although you have that >> set very large, so would take a request to be stuck for 6000 seconds before >> you saw it. >> >> Only way to tell if may be that is to use: >> >> >> https://modwsgi.readthedocs.io/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces >> >> so can signal mod_wsgi to dump Python stack traces periodically and see >> if a specific request is stuck some where. >> >> Graham >> >> Thanks, Graham. >> >> On Monday, August 31, 2020 at 4:56:46 PM UTC-4 Graham Dumpleton wrote: >> >>> Look in the main Apache error log, not the virtual host, and you will >>> probably find a message about a segmentation fault or other error. >>> >>> Where it is quite random like this, unless you can enable capture of the >>> core dump file and can run a debugger (gdb) on that to get a stack trace, >>> is going to be hard to track it down. >>> >>> Only other thing can suggest is watching the process size to see if >>> getting so large that you are running out of memory and a failure to >>> allocate memory causes something to crash. >>> >>> Graham >>> >>> On 1 Sep 2020, at 3:28 am, David White <[email protected]> wrote: >>> >>> >>> Hello. I am running Apache 2.4.43 with mod_wsgi 4.71 compiled against >>> Python 3.8.3 (all manually compiled, not part of the RHEL 8.1 distro). >>> Apache MPM is "event". >>> >>> The application running in one of my virtual hosts will occasionally >>> crash, but randomly. Repeating the same request immediately will usually >>> succeed. The application may continue working for hours before randomly >>> crashing again. >>> >>> This started happening recently with an upgrade of the Flask and >>> SQLAlchemy modules. (again, all manually installed via pip) >>> >>> A crash is reported this way in the vhost's error_log: >>> >>> [Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid >>> 140546546816768] [client 10.83.210.200:47266] Truncated or oversized >>> response headers received from daemon process 'sgm-prod': >>> /apps/www/sgm/wsgi/SGM/run.py >>> [Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid >>> 140546571994880] [client 10.31.82.142:35810] Truncated or oversized >>> response headers received from daemon process 'sgm-prod': >>> /apps/www/sgm/wsgi/SGM/run.py >>> [Mon Aug 31 11:11:08.512262 2020] [wsgi:info] [pid 40601:tid >>> 140547110185856] mod_wsgi (pid=40601): Attach interpreter ''. >>> [Mon Aug 31 11:11:08.514415 2020] [wsgi:info] [pid 40601:tid >>> 140547110185856] mod_wsgi (pid=40601): Adding '/apps/www/sgm/wsgi/SGM' to >>> path. >>> [Mon Aug 31 11:11:08.514526 2020] [wsgi:info] [pid 40601:tid >>> 140547110185856] mod_wsgi (pid=40601): Adding >>> '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path. >>> [Mon Aug 31 11:11:08.515520 2020] [wsgi:debug] [pid 40601:tid >>> 140546715096832] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 0 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515575 2020] [wsgi:debug] [pid 40601:tid >>> 140546706704128] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 1 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515629 2020] [wsgi:debug] [pid 40601:tid >>> 140546698311424] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 2 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515702 2020] [wsgi:debug] [pid 40601:tid >>> 140546689918720] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 3 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515737 2020] [wsgi:debug] [pid 40601:tid >>> 140546681526016] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 4 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515766 2020] [wsgi:debug] [pid 40601:tid >>> 140546673133312] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 5 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515795 2020] [wsgi:debug] [pid 40601:tid >>> 140546664740608] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 6 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515821 2020] [wsgi:debug] [pid 40601:tid >>> 140546656347904] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 7 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515853 2020] [wsgi:debug] [pid 40601:tid >>> 140546647955200] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 8 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515921 2020] [wsgi:debug] [pid 40601:tid >>> 140546639562496] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 9 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515948 2020] [wsgi:debug] [pid 40601:tid >>> 140546631169792] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 10 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.515972 2020] [wsgi:debug] [pid 40601:tid >>> 140546622777088] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 11 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.516002 2020] [wsgi:debug] [pid 40601:tid >>> 140546614384384] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 12 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.516036 2020] [wsgi:debug] [pid 40601:tid >>> 140546605991680] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 13 in daemon process 'sgm-prod'. >>> [Mon Aug 31 11:11:08.516061 2020] [wsgi:debug] [pid 40601:tid >>> 140546597598976] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started >>> thread 14 in daemon process 'sgm-prod'. >>> >>> The WSGI portion of the configuration for the vhost in Apache looks like >>> this: >>> >>> WSGIPassAuthorization On >>> LogLevel info wsgi:trace6 >>> SetEnv SGM_PRODUCTION 1 >>> SetEnv SGM_USE_ORACLE 1 >>> WSGIDaemonProcess sgm-prod user=sgmuser group=sgmgroup threads=15 >>> python-home=/apps/sgmuser/.virtualenvs/sgm >>> python-path=/apps/www/sgm/wsgi/SGM:/apps/sgmuser/.virtualenvs/sgm/lib/python3.8/site-packages >>> >>> socket-timeout=6000 >>> WSGIScriptAlias / /apps/www/sgm/wsgi/SGM/run.py >>> <Directory /apps/www/sgm/wsgi/SGM> >>> Require all granted >>> AllowOverride AuthConfig >>> WSGIProcessGroup sgm-prod >>> WSGIApplicationGroup %{GLOBAL} >>> </Directory> >>> >>> The main Apache error log shows nothing relevant, and the >>> application-specific logs (with Python logging messages) show only that the >>> request was made. There is no exception traceback data being logged (I'm >>> guessing the app is crashing before that can happen.) >>> >>> Given how transitory this error is, I'm not sure how else to configure >>> Apache or the SGM app itself to get more details about why the "sgm-prod" >>> process seems to be crashing. >>> >>> Any help would be appreciated! Thanks. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "modwsgi" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/modwsgi/8df00cf3-abeb-43a8-b14a-6666af322a9an%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/modwsgi/8df00cf3-abeb-43a8-b14a-6666af322a9an%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "modwsgi" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/modwsgi/c5769203-50ef-4501-a3ac-ada2ddc6caaen%40googlegroups.com >> >> <https://groups.google.com/d/msgid/modwsgi/c5769203-50ef-4501-a3ac-ada2ddc6caaen%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> >> > -- > You received this message because you are subscribed to the Google Groups > "modwsgi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/modwsgi/7d8ddafa-3465-4207-8762-b64f3b3a5de3n%40googlegroups.com > > <https://groups.google.com/d/msgid/modwsgi/7d8ddafa-3465-4207-8762-b64f3b3a5de3n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > > -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/66fe0218-aa5c-4990-b1d5-af690bcd9ec5n%40googlegroups.com.
