I placed the monitors inside of the WSGI script and I am not seeing any stacktraces. However, when we begin to see timeouts, I can see that Apache begins to start respawning child processes (based on the "Starting stack trace monitor" in the logs). Looking at dmesg, I can see that Apache hits an out of memory and then kills a process, but I haven't timed those to see if they align with the stacktrace messages. What makes this all confusing is that during the flood of workers and timeouts of our scripts, I can still run individual queries and get back responses.
On Wednesday, March 16, 2016 at 9:41:23 AM UTC-7, [email protected] wrote: > > Correct, DataDog. I can add that code into our WSGI scripts in dev and see > how it works. Will report back. > > On Tuesday, March 15, 2016 at 7:38:59 PM UTC-7, Graham Dumpleton wrote: >> >> Okay, is DataDog. Thought it was but first charts found on their web site >> didn’t show the legend. >> >> On 16 Mar 2016, at 1:36 PM, Graham Dumpleton <[email protected]> >> wrote: >> >> What is the monitoring system you are using? The UI looks familiar, but >> can’t remember what system that is from. >> >> How hard would it be for you to add a bit of Python code to the WSGI >> script file for your application which starts a background thread that >> reports some extra metrics on a periodic basis? >> >> Also, the fact that it appears to be backlogged looks a bit like stuck >> requests in the Python web application so causing an effect in the Apache >> child worker processes as shown by your monitoring. The added metric I am >> thinking of would confirm that. >> >> A more brute force way of tracking down if requests are getting stuck is >> to add to your WSGI script file: >> >> >> http://modwsgi.readthedocs.org/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces >> >> That way when backlogging occurs and busy workers increases, can force >> logging of what Python threads in web application is doing at that point. >> If threads are stuck, will tell you where. >> >> Graham >> >> On 16 Mar 2016, at 1:21 PM, [email protected] wrote: >> >> Clarifying on the first line - In our testing, our client is requesting >> at 3 requests per second. There could be more, but it should not exceed 6. >> >> The request handlers are waiting on a web request that is spawned to >> another server which then queries the database. The CPU load is so low it >> barely crosses 3% and that's at a high peak. We are typically below 1%. >> >> Size of the request payload is small and is merely a simple query, though >> requests can vary in size and range from roughly 3KB to 100KB. >> >> Attached is a screenshot of our logging that is capturing >> busy/idle/queries on a timeline. Where the yellow line goes to zero and the >> workers start to increase is where we begin to see timeouts. The eventual >> dip after the peak is me bouncing the apache damon in order to get it back >> under some control. >> >> On Tuesday, March 15, 2016 at 6:35:13 PM UTC-7, Graham Dumpleton wrote: >>> >>> >>> On 16 Mar 2016, at 12:10 PM, [email protected] wrote: >>> >>> I am hoping to gain some clarity here on our WSGI configuration since a >>> lot of the tuning seems to be heavily reliant on the application itself. >>> >>> Our setup >>> >>> - Single load balancer (round robin) >>> - Two virtual servers with 16GB of RAM >>> - Python app ~100MB in memory per process >>> - Response times are longer as we broker calls, so it could be up to >>> 1-2 seconds >>> - Running WSGI 4.4.2 on Ubuntu LTS 14 with Apache 2 >>> - WSGI Daemon mode running (30 processes with 25 threads) >>> - KeepAlives are off >>> - WSGI Restrict embedded is on >>> - Using MPM event >>> >>> For Apache, we have the following: >>> >>> - StartServers 30 >>> - MinSpareThreads 40 >>> - MaxSpareThreads 150 >>> - ThreadsPerChild 25 >>> - MaxRequestWorkers 600 >>> >>> I have tried a number of different scenarios, but all of them generally >>> lead to the same problem. We are processing about 3 requests a second with >>> a steady number of worker threads and plenty of idle in place. After a few >>> minutes of sustained traffic, we eventually start timing out which then >>> leads to worker counts driving up until it's reached the MaxRequestWorkers. >>> Despite this, I am still able to issue requests and get responses, but it >>> ultimately leads to apache becoming unresponsive. >>> >>> >>> Just to confirm. You say that you never go above 3 requests per second, >>> but that at worst case those requests can take 2 seconds. Correct? >>> >>> Are the request handlers predominantly waiting on backend database >>> calls, or are they doing more CPU intensive work? What is the CPU load on >>> the mod_wsgi daemon processes? >>> >>> Also, what is the size of payloads, for requests and responses? >>> >>> Graham >>> >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "modwsgi" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/modwsgi. >> For more options, visit https://groups.google.com/d/optout. >> <Screen Shot 2016-03-15 at 7.19.45 PM.png> >> >> >> >> -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
