This is such a cryptic thing to debug.  I can't find any sign of a core 
dump or segmentation fault in any of the Apache main logs, the Apache vhost 
logs, or the system logs in /var/log .   The main Apache server log (now 
with "LogLevel debug") shows only the crash event and nothing else 
particularly helpful.  The sgm-prod process can survive for hours, or it 
may die every 3-4 minutes.

[Tue Sep 01 09:00:49.625195 2020] [wsgi:info] [pid 130799:tid 
140547110185856] mod_wsgi (pid=130799): Initializing Python.
[Tue Sep 01 11:02:27.218264 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has died, 
deregister and restart it.
[Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by 
signal 9
[Tue Sep 01 11:02:27.218317 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has been 
deregistered and will no longer be monitored.
[Tue Sep 01 11:02:27.219893 2020] [wsgi:info] [pid 47466:tid 
140547110185856] mod_wsgi (pid=47466): Starting process 'sgm-prod' with 
uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:02:27.230831 2020] [wsgi:info] [pid 47466:tid 
140547110185856] mod_wsgi (pid=47466): Python home 
/apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:02:27.230893 2020] [wsgi:info] [pid 47466:tid 
140547110185856] mod_wsgi (pid=47466): Initializing Python.

[Tue Sep 01 11:04:44.365899 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has died, 
deregister and restart it.
[Tue Sep 01 11:04:44.366026 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' terminated by 
signal 9
[Tue Sep 01 11:04:44.366035 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has been 
deregistered and will no longer be monitored.
[Tue Sep 01 11:04:44.367023 2020] [wsgi:info] [pid 47872:tid 
140547110185856] mod_wsgi (pid=47872): Starting process 'sgm-prod' with 
uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:04:44.374176 2020] [wsgi:info] [pid 47872:tid 
140547110185856] mod_wsgi (pid=47872): Python home 
/apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:04:44.374232 2020] [wsgi:info] [pid 47872:tid 
140547110185856] mod_wsgi (pid=47872): Initializing Python.

[Tue Sep 01 11:07:29.569996 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has died, 
deregister and restart it.
[Tue Sep 01 11:07:29.570049 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' terminated by 
signal 9
[Tue Sep 01 11:07:29.570054 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has been 
deregistered and will no longer be monitored.
[Tue Sep 01 11:07:29.571142 2020] [wsgi:info] [pid 48657:tid 
140547110185856] mod_wsgi (pid=48657): Starting process 'sgm-prod' with 
uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:07:29.577718 2020] [wsgi:info] [pid 48657:tid 
140547110185856] mod_wsgi (pid=48657): Python home 
/apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:07:29.577776 2020] [wsgi:info] [pid 48657:tid 
140547110185856] mod_wsgi (pid=48657): Initializing Python.

[Tue Sep 01 11:10:18.754372 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has died, 
deregister and restart it.
[Tue Sep 01 11:10:18.754433 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' terminated by 
signal 9
[Tue Sep 01 11:10:18.754438 2020] [wsgi:info] [pid 15161:tid 
140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has been 
deregistered and will no longer be monitored.
[Tue Sep 01 11:10:18.755506 2020] [wsgi:info] [pid 49225:tid 
140547110185856] mod_wsgi (pid=49225): Starting process 'sgm-prod' with 
uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:10:18.762162 2020] [wsgi:info] [pid 49225:tid 
140547110185856] mod_wsgi (pid=49225): Python home 
/apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:10:18.762215 2020] [wsgi:info] [pid 49225:tid 
140547110185856] mod_wsgi (pid=49225): Initializing Python.

Memory and OOM don't seem to be a factor - this is a beefy virtual machine 
with 58GB RAM, rarely ever exceeding 5-6 GB even under load.  No "process 
killed" messages in dmesg or /var/log/messages .

It may be that Apache wasn't properly configured for core dumps.  I've made 
some config changes and did a "kill -SIGSEGV" on both the main httpd 
process and (after restarting) a thread process.  Both generated crash 
dumps in /tmp/crash , so I'll wait to see if I get anything useful in there 
the next time the app crashes for real.

You asked what third-party packages are part of the virtual environment, 
and it's pretty extensive.  Flask+SQLAlchemy make up the core of the 
application, and there are a lot of libraries that either came along for 
the ride, or were manually installed.

aniso8601==8.0.0
bcrypt==3.2.0
blinker==1.4
Cerberus==1.3.2
certifi==2020.6.20
cffi==1.14.2
chardet==3.0.4
click==7.1.2
colorama==0.4.3
coverage==5.2.1
cx-Oracle==5.3
dateparser==0.7.6
Eve==1.1.2
Eve-SQLAlchemy==0.7.1
Events==0.3
Flask==1.1.2
Flask-Admin==1.5.6
Flask-Bcrypt==0.7.1
Flask-DebugToolbar==0.11.0
flask-ldap3-login==0.9.16
Flask-LDAPConn==0.10.1
Flask-Login==0.4.1
Flask-PyMongo==2.3.0
Flask-RESTful==0.3.8
Flask-Restless==0.17.0
Flask-SQLAlchemy==2.4.4
Flask-WTF==0.14.3
httpie==2.2.0
idna==2.10
inflect==4.1.0
itsdangerous==1.1.0
jdatetime==3.6.2
Jinja2==2.11.2
jinja2-pluralize==0.3.0
jsmin==2.2.2
ldap3==2.8
lxml==4.5.2
MarkupSafe==1.1.1
marshmallow==2.10.3
mimerender==0.6.0
netaddr==0.8.0
nose2==0.9.2
ordered-set==4.0.2
pyasn1==0.4.8
pycparser==2.20
Pygments==2.6.1
pymongo==3.11.0
python-dateutil==2.8.1
python-mimeparse==1.6.0
pytz==2020.1
PyYAML==5.3.1
redis==3.5.3
regex==2020.7.14
requests==2.24.0
ruamel.yaml==0.16.10
ruamel.yaml.clib==0.2.0
simplejson==3.17.2
six==1.15.0
SQLAlchemy==1.3.19
sqlalchemy-datatables==2.0.1
sqlalchemy-views==0.2.4
strict-rfc3339==0.7
tzlocal==2.1
umalqurra==0.2
urllib3==1.25.10
webargs==1.4.0
Werkzeug==1.0.1
WTForms==2.3.3
xmljson==0.2.1
xmltodict==0.12.0

I'll keep fighting with it.  Thanks again for the suggestions, Graham.

On Monday, August 31, 2020 at 11:48:31 PM UTC-4 Graham Dumpleton wrote:

> And another possibility is that your Linux OOM killer is killing the 
> processes because they are using up too much memory.
>
> https://docs.memset.com/other/linux-s-oom-process-killer
>
> Apache is quite susceptible to being killed by it because of how it uses 
> resources.
>
> Check the logs it describes to see if is being triggered.
>
>
> On 1 Sep 2020, at 9:18 am, Graham Dumpleton <[email protected]> wrote:
>
> There must be a core dump or segmentation fault message in main Apache 
> error logs if the daemon process is crashing.
>
> One other remote possibility have seen with some third party libraries (C 
> libraries, not so much Python wrappers), is that when they encounter an 
> error, they call C library exit(), stopping the process. This results in no 
> crash report.
>
> What are the third party Python packages you are using?
>
> Graham
>
> On 1 Sep 2020, at 7:58 am, David White <[email protected]> wrote:
>
> *>  another cause can be that you are using an external system to trigger 
> log file rotation * 
>
> That is an inspired guess!  I am in fact using "cronolog" to mange logfile 
> rotation.  I don't *think* it's the culprit, since it rotates only 
> monthly and crashes can happen many times a day (sometimes minutes apart).  
> But removing it can at least narrow the variables down.  
>
> > requests that are getting stuck, and eventually you hit that socket 
> timeout
>
> It appears the crashes are pretty much immediate, within a second of a 
> request (or multiple requests) coming in.  The server may run 10 hours 
> before the next crash, or it may crash again within 5 minutes.  There does 
> seem to be some correlation between request load and crashing likelihood - 
> crashes almost always happen overnight when multiple cron jobs are hitting 
> the program at once.
>
> Meanwhile I'll stuff "WSGIRestrictEmbedded On" into the main httpd.conf 
> and see what happens.
>
> I appreciate the debugging advice.
>
> On Monday, August 31, 2020 at 5:16:48 PM UTC-4 Graham Dumpleton wrote:
>
>>
>> On 1 Sep 2020, at 7:06 am, David White <[email protected]> wrote:
>>
>> Unfortunately the main Apache log shows nothing except normal 
>> startup/shutdown messages.  If the "sgm-prod" threads are being terminated 
>> and restarted, they are leaving no indication of that in the main server 
>> logs.  
>>
>> The only clue appears to be that the crashing occurs during more heavy 
>> loads.  The application does not often strain the CPU/RAM of the underlying 
>> host, but the Apache vhost logs show the crashes seem to occur when 
>> multiple requests are being handled nearly simultaneously.  
>>
>> [Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid 
>> 140546546816768] [client 10.83.210.200:47266] Truncated or oversized 
>> response headers received from daemon process 'sgm-prod': 
>> /apps/www/sgm/wsgi/SGM/run.py
>> [Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid 
>> 140546571994880] [client 10.31.82.142:35810] Truncated or oversized 
>> response headers received from daemon process 'sgm-prod': 
>> /apps/www/sgm/wsgi/SGM/run.py 
>>
>> Am I likely to see any benefit to either:
>>
>>    1. Reducing the process or thread count passed to WSGIDaemonProcess, 
>>    or
>>    2. Putting Apache into "prefork" MPM?  
>>    
>> Never use prefork MPM if you can avoid it.
>>
>> Also ensure that outside of VirtualHost you have:
>>
>>    WSGIRestrictEmbedded On
>>
>> to ensure you are never accidentally running in the main Apache child 
>> worker processes.
>>
>> If there is no evidence of a crash, then another cause can be that you 
>> are using an external system to trigger log file rotation, rather than 
>> recommended method of using Apache's own log file rotation method.
>>
>> An external log file rotation system usually only fires once per day, but 
>> maybe yours is set up differently based on size of logs.
>>
>> The problem with an external log file rotation system is that it signals 
>> Apache to restart. Although the main Apache child worker process are given 
>> a grace period to finish requests, the way Apache libraries implement 
>> management of third party processes such as the mod_wsgi daemon processes 
>> is that it will kill them after 5 seconds if they don't shutdown quick 
>> enough. This results in requests being proxied by Apache child worker 
>> processes being chopped off and you see that message.
>>
>> If it is this though, you should see clear messages at info level in logs 
>> from mod_wsgi that the daemon processes were being shutdown and restarted. 
>> This will appear some time before you see that message.
>>
>> Only other thing can think of is that you have requests that are getting 
>> stuck, and eventually you hit that socket timeout, although you have that 
>> set very large, so would take a request to be stuck for 6000 seconds before 
>> you saw it.
>>
>> Only way to tell if may be that is to use:
>>
>>
>> https://modwsgi.readthedocs.io/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces
>>
>> so can signal mod_wsgi to dump Python stack traces periodically and see 
>> if a specific request is stuck some where.
>>
>> Graham
>>
>> Thanks, Graham.
>>
>> On Monday, August 31, 2020 at 4:56:46 PM UTC-4 Graham Dumpleton wrote:
>>
>>> Look in the main Apache error log, not the virtual host, and you will 
>>> probably find a message about a segmentation fault or other error.
>>>
>>> Where it is quite random like this, unless you can enable capture of the 
>>> core dump file and can run a debugger (gdb) on that to get a stack trace, 
>>> is going to be hard to track it down.
>>>
>>> Only other thing can suggest is watching the process size to see if 
>>> getting so large that you are running out of memory and a failure to 
>>> allocate memory causes something to crash.
>>>
>>> Graham
>>>
>>> On 1 Sep 2020, at 3:28 am, David White <[email protected]> wrote:
>>>
>>>
>>> Hello.  I am running Apache 2.4.43 with mod_wsgi 4.71 compiled against 
>>> Python 3.8.3 (all manually compiled, not part of the RHEL 8.1 distro).  
>>> Apache MPM is "event".
>>>
>>> The application running in one of my virtual hosts will occasionally 
>>> crash, but randomly.  Repeating the same request immediately will usually 
>>> succeed.  The application may continue working for hours before randomly 
>>> crashing again.
>>>
>>> This started happening recently with an upgrade of the Flask and 
>>> SQLAlchemy modules.  (again, all manually installed via pip)
>>>
>>> A crash is reported this way in the vhost's error_log:
>>>
>>> [Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid 
>>> 140546546816768] [client 10.83.210.200:47266] Truncated or oversized 
>>> response headers received from daemon process 'sgm-prod': 
>>> /apps/www/sgm/wsgi/SGM/run.py
>>> [Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid 
>>> 140546571994880] [client 10.31.82.142:35810] Truncated or oversized 
>>> response headers received from daemon process 'sgm-prod': 
>>> /apps/www/sgm/wsgi/SGM/run.py
>>> [Mon Aug 31 11:11:08.512262 2020] [wsgi:info] [pid 40601:tid 
>>> 140547110185856] mod_wsgi (pid=40601): Attach interpreter ''.
>>> [Mon Aug 31 11:11:08.514415 2020] [wsgi:info] [pid 40601:tid 
>>> 140547110185856] mod_wsgi (pid=40601): Adding '/apps/www/sgm/wsgi/SGM' to 
>>> path.
>>> [Mon Aug 31 11:11:08.514526 2020] [wsgi:info] [pid 40601:tid 
>>> 140547110185856] mod_wsgi (pid=40601): Adding 
>>> '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path.
>>> [Mon Aug 31 11:11:08.515520 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546715096832] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 0 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515575 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546706704128] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 1 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515629 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546698311424] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 2 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515702 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546689918720] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 3 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515737 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546681526016] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 4 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515766 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546673133312] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 5 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515795 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546664740608] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 6 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515821 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546656347904] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 7 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515853 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546647955200] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 8 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515921 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546639562496] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 9 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515948 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546631169792] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 10 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.515972 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546622777088] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 11 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.516002 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546614384384] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 12 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.516036 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546605991680] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 13 in daemon process 'sgm-prod'.
>>> [Mon Aug 31 11:11:08.516061 2020] [wsgi:debug] [pid 40601:tid 
>>> 140546597598976] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started 
>>> thread 14 in daemon process 'sgm-prod'.
>>>
>>> The WSGI portion of the configuration for the vhost in Apache looks like 
>>> this:
>>>
>>> WSGIPassAuthorization On
>>> LogLevel info wsgi:trace6
>>> SetEnv SGM_PRODUCTION  1
>>> SetEnv SGM_USE_ORACLE  1
>>> WSGIDaemonProcess sgm-prod user=sgmuser group=sgmgroup threads=15 
>>> python-home=/apps/sgmuser/.virtualenvs/sgm 
>>> python-path=/apps/www/sgm/wsgi/SGM:/apps/sgmuser/.virtualenvs/sgm/lib/python3.8/site-packages
>>>  
>>> socket-timeout=6000
>>> WSGIScriptAlias / /apps/www/sgm/wsgi/SGM/run.py
>>> <Directory /apps/www/sgm/wsgi/SGM>
>>>     Require all granted
>>>     AllowOverride AuthConfig
>>>     WSGIProcessGroup sgm-prod
>>>     WSGIApplicationGroup %{GLOBAL}
>>> </Directory>
>>>
>>> The main Apache error log shows nothing relevant, and the 
>>> application-specific logs (with Python logging messages) show only that the 
>>> request was made.  There is no exception traceback data being logged (I'm 
>>> guessing the app is crashing before that can happen.)
>>>
>>> Given how transitory this error is, I'm not sure how else to configure 
>>> Apache or the SGM app itself to get more details about why the "sgm-prod" 
>>> process seems to be crashing.
>>>
>>> Any help would be appreciated!  Thanks.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "modwsgi" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/modwsgi/8df00cf3-abeb-43a8-b14a-6666af322a9an%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/modwsgi/8df00cf3-abeb-43a8-b14a-6666af322a9an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>>
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "modwsgi" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/modwsgi/c5769203-50ef-4501-a3ac-ada2ddc6caaen%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/modwsgi/c5769203-50ef-4501-a3ac-ada2ddc6caaen%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>>
>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "modwsgi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/modwsgi/7d8ddafa-3465-4207-8762-b64f3b3a5de3n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/modwsgi/7d8ddafa-3465-4207-8762-b64f3b3a5de3n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/modwsgi/66fe0218-aa5c-4990-b1d5-af690bcd9ec5n%40googlegroups.com.

Reply via email to