Problem solved. Can't believe it, but didn't have a hard timeout on one of the backend calls, so it was hanging around forever.
On Friday, March 18, 2016 at 2:53:22 PM UTC-7, [email protected] wrote: > > I placed the monitors inside of the WSGI script and I am not seeing any > stacktraces. However, when we begin to see timeouts, I can see that Apache > begins to start respawning child processes (based on the "Starting stack > trace monitor" in the logs). Looking at dmesg, I can see that Apache hits > an out of memory and then kills a process, but I haven't timed those to see > if they align with the stacktrace messages. What makes this all confusing > is that during the flood of workers and timeouts of our scripts, I can > still run individual queries and get back responses. > > On Wednesday, March 16, 2016 at 9:41:23 AM UTC-7, [email protected] > wrote: >> >> Correct, DataDog. I can add that code into our WSGI scripts in dev and >> see how it works. Will report back. >> >> On Tuesday, March 15, 2016 at 7:38:59 PM UTC-7, Graham Dumpleton wrote: >>> >>> Okay, is DataDog. Thought it was but first charts found on their web >>> site didn’t show the legend. >>> >>> On 16 Mar 2016, at 1:36 PM, Graham Dumpleton <[email protected]> >>> wrote: >>> >>> What is the monitoring system you are using? The UI looks familiar, but >>> can’t remember what system that is from. >>> >>> How hard would it be for you to add a bit of Python code to the WSGI >>> script file for your application which starts a background thread that >>> reports some extra metrics on a periodic basis? >>> >>> Also, the fact that it appears to be backlogged looks a bit like stuck >>> requests in the Python web application so causing an effect in the Apache >>> child worker processes as shown by your monitoring. The added metric I am >>> thinking of would confirm that. >>> >>> A more brute force way of tracking down if requests are getting stuck is >>> to add to your WSGI script file: >>> >>> >>> http://modwsgi.readthedocs.org/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces >>> >>> That way when backlogging occurs and busy workers increases, can force >>> logging of what Python threads in web application is doing at that point. >>> If threads are stuck, will tell you where. >>> >>> Graham >>> >>> On 16 Mar 2016, at 1:21 PM, [email protected] wrote: >>> >>> Clarifying on the first line - In our testing, our client is requesting >>> at 3 requests per second. There could be more, but it should not exceed 6. >>> >>> The request handlers are waiting on a web request that is spawned to >>> another server which then queries the database. The CPU load is so low it >>> barely crosses 3% and that's at a high peak. We are typically below 1%. >>> >>> Size of the request payload is small and is merely a simple query, >>> though requests can vary in size and range from roughly 3KB to 100KB. >>> >>> Attached is a screenshot of our logging that is capturing >>> busy/idle/queries on a timeline. Where the yellow line goes to zero and the >>> workers start to increase is where we begin to see timeouts. The eventual >>> dip after the peak is me bouncing the apache damon in order to get it back >>> under some control. >>> >>> On Tuesday, March 15, 2016 at 6:35:13 PM UTC-7, Graham Dumpleton wrote: >>>> >>>> >>>> On 16 Mar 2016, at 12:10 PM, [email protected] wrote: >>>> >>>> I am hoping to gain some clarity here on our WSGI configuration since a >>>> lot of the tuning seems to be heavily reliant on the application itself. >>>> >>>> Our setup >>>> >>>> - Single load balancer (round robin) >>>> - Two virtual servers with 16GB of RAM >>>> - Python app ~100MB in memory per process >>>> - Response times are longer as we broker calls, so it could be up >>>> to 1-2 seconds >>>> - Running WSGI 4.4.2 on Ubuntu LTS 14 with Apache 2 >>>> - WSGI Daemon mode running (30 processes with 25 threads) >>>> - KeepAlives are off >>>> - WSGI Restrict embedded is on >>>> - Using MPM event >>>> >>>> For Apache, we have the following: >>>> >>>> - StartServers 30 >>>> - MinSpareThreads 40 >>>> - MaxSpareThreads 150 >>>> - ThreadsPerChild 25 >>>> - MaxRequestWorkers 600 >>>> >>>> I have tried a number of different scenarios, but all of them generally >>>> lead to the same problem. We are processing about 3 requests a second with >>>> a steady number of worker threads and plenty of idle in place. After a few >>>> minutes of sustained traffic, we eventually start timing out which then >>>> leads to worker counts driving up until it's reached the >>>> MaxRequestWorkers. >>>> Despite this, I am still able to issue requests and get responses, but it >>>> ultimately leads to apache becoming unresponsive. >>>> >>>> >>>> Just to confirm. You say that you never go above 3 requests per second, >>>> but that at worst case those requests can take 2 seconds. Correct? >>>> >>>> Are the request handlers predominantly waiting on backend database >>>> calls, or are they doing more CPU intensive work? What is the CPU load on >>>> the mod_wsgi daemon processes? >>>> >>>> Also, what is the size of payloads, for requests and responses? >>>> >>>> Graham >>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "modwsgi" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/modwsgi. >>> For more options, visit https://groups.google.com/d/optout. >>> <Screen Shot 2016-03-15 at 7.19.45 PM.png> >>> >>> >>> >>> -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/modwsgi. For more options, visit https://groups.google.com/d/optout.
