Did you manage to work that out because you finally got stack traces out? If the code was in place properly, then touching the marker file at any time should have resulted in the stack traces being dumped.
BTW, in recent versions of mod_wsgi, in daemon mode you can set a request-timeout option. If you think you have no identified the blocking requests and fixed it, but want a fail safe, set request-timeout to a value above what you expect longest requests to run for. For a single threaded process if a request goes longer than that, a graceful process restart will be triggered to knock the blocked request out by killing the process. Similarly for multithreaded process, except that when the process is restarted is calculated a bit strangely because of use of multithreading. In that case the request-timeout applies to the average blocked time across all request handler slots. So if you have two threads and one was idle, but the initial request had reach 2*request-timeout then average across all handler slots is request-timeout and so is kicked out. The strange calculation is to combat issue of one long running request exceeding the request-timeout and cause other requests to be interrupted. By taking average, then in a lightly loaded process where there is still spare capacity a single thread can be allowed to run longer to see if it will still succeed, since have the capacity to handle other requests. If other requests also start to backlog, or that one requests runs long enough, then it pulls the average up and finally one nukes the process anyway. Seemed a bit fairer to allow process to keep running if still have capacity. Finally, the point of explaining this is that when request-timeout is used and the process is restarted due to it, mod_wsgi itself will dump out stack traces of what all the Python request handler threads were doing. That way you can see in the normal logs, without doing anything, which threads were stuck and caused the limit to be reached. That way can be enabled permanently. Graham > On 19 Mar 2016, at 9:49 AM, [email protected] wrote: > > Problem solved. Can't believe it, but didn't have a hard timeout on one of > the backend calls, so it was hanging around forever. > > On Friday, March 18, 2016 at 2:53:22 PM UTC-7, [email protected] wrote: > I placed the monitors inside of the WSGI script and I am not seeing any > stacktraces. However, when we begin to see timeouts, I can see that Apache > begins to start respawning child processes (based on the "Starting stack > trace monitor" in the logs). Looking at dmesg, I can see that Apache hits an > out of memory and then kills a process, but I haven't timed those to see if > they align with the stacktrace messages. What makes this all confusing is > that during the flood of workers and timeouts of our scripts, I can still run > individual queries and get back responses. > > On Wednesday, March 16, 2016 at 9:41:23 AM UTC-7, [email protected] <> > wrote: > Correct, DataDog. I can add that code into our WSGI scripts in dev and see > how it works. Will report back. > > On Tuesday, March 15, 2016 at 7:38:59 PM UTC-7, Graham Dumpleton wrote: > Okay, is DataDog. Thought it was but first charts found on their web site > didn’t show the legend. > >> On 16 Mar 2016, at 1:36 PM, Graham Dumpleton <[email protected] <>> >> wrote: >> >> What is the monitoring system you are using? The UI looks familiar, but >> can’t remember what system that is from. >> >> How hard would it be for you to add a bit of Python code to the WSGI script >> file for your application which starts a background thread that reports some >> extra metrics on a periodic basis? >> >> Also, the fact that it appears to be backlogged looks a bit like stuck >> requests in the Python web application so causing an effect in the Apache >> child worker processes as shown by your monitoring. The added metric I am >> thinking of would confirm that. >> >> A more brute force way of tracking down if requests are getting stuck is to >> add to your WSGI script file: >> >> http://modwsgi.readthedocs.org/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces >> >> <http://modwsgi.readthedocs.org/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces> >> >> That way when backlogging occurs and busy workers increases, can force >> logging of what Python threads in web application is doing at that point. If >> threads are stuck, will tell you where. >> >> Graham >> >>> On 16 Mar 2016, at 1:21 PM, [email protected] <> wrote: >>> >>> Clarifying on the first line - In our testing, our client is requesting at >>> 3 requests per second. There could be more, but it should not exceed 6. >>> >>> The request handlers are waiting on a web request that is spawned to >>> another server which then queries the database. The CPU load is so low it >>> barely crosses 3% and that's at a high peak. We are typically below 1%. >>> >>> Size of the request payload is small and is merely a simple query, though >>> requests can vary in size and range from roughly 3KB to 100KB. >>> >>> Attached is a screenshot of our logging that is capturing busy/idle/queries >>> on a timeline. Where the yellow line goes to zero and the workers start to >>> increase is where we begin to see timeouts. The eventual dip after the peak >>> is me bouncing the apache damon in order to get it back under some control. >>> >>> On Tuesday, March 15, 2016 at 6:35:13 PM UTC-7, Graham Dumpleton wrote: >>> >>>> On 16 Mar 2016, at 12:10 PM, [email protected] <> wrote: >>>> >>>> I am hoping to gain some clarity here on our WSGI configuration since a >>>> lot of the tuning seems to be heavily reliant on the application itself. >>>> >>>> Our setup >>>> Single load balancer (round robin) >>>> Two virtual servers with 16GB of RAM >>>> Python app ~100MB in memory per process >>>> Response times are longer as we broker calls, so it could be up to 1-2 >>>> seconds >>>> Running WSGI 4.4.2 on Ubuntu LTS 14 with Apache 2 >>>> WSGI Daemon mode running (30 processes with 25 threads) >>>> KeepAlives are off >>>> WSGI Restrict embedded is on >>>> Using MPM event >>>> For Apache, we have the following: >>>> StartServers 30 >>>> MinSpareThreads 40 >>>> MaxSpareThreads 150 >>>> ThreadsPerChild 25 >>>> MaxRequestWorkers 600 >>>> I have tried a number of different scenarios, but all of them generally >>>> lead to the same problem. We are processing about 3 requests a second with >>>> a steady number of worker threads and plenty of idle in place. After a few >>>> minutes of sustained traffic, we eventually start timing out which then >>>> leads to worker counts driving up until it's reached the >>>> MaxRequestWorkers. Despite this, I am still able to issue requests and get >>>> responses, but it ultimately leads to apache becoming unresponsive. >>> >>> Just to confirm. You say that you never go above 3 requests per second, but >>> that at worst case those requests can take 2 seconds. Correct? >>> >>> Are the request handlers predominantly waiting on backend database calls, >>> or are they doing more CPU intensive work? What is the CPU load on the >>> mod_wsgi daemon processes? >>> >>> Also, what is the size of payloads, for requests and responses? >>> >>> Graham >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "modwsgi" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to [email protected] <>. >>> To post to this group, send email to [email protected] <>. >>> Visit this group at https://groups.google.com/group/modwsgi >>> <https://groups.google.com/group/modwsgi>. >>> For more options, visit https://groups.google.com/d/optout >>> <https://groups.google.com/d/optout>. >>> <Screen Shot 2016-03-15 at 7.19.45 PM.png> >> > > > -- > You received this message because you are subscribed to the Google Groups > "modwsgi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected] > <mailto:[email protected]>. > To post to this group, send email to [email protected] > <mailto:[email protected]>. > Visit this group at https://groups.google.com/group/modwsgi > <https://groups.google.com/group/modwsgi>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
