Problem solved. Can't believe it, but didn't have a hard timeout on one of 
the backend calls, so it was hanging around forever. 

On Friday, March 18, 2016 at 2:53:22 PM UTC-7, [email protected] wrote:
>
> I placed the monitors inside of the WSGI script and I am not seeing any 
> stacktraces. However, when we begin to see timeouts, I can see that Apache 
> begins to start respawning child processes (based on the "Starting stack 
> trace monitor" in the logs). Looking at dmesg, I can see that Apache hits 
> an out of memory and then kills a process, but I haven't timed those to see 
> if they align with the stacktrace messages. What makes this all confusing 
> is that during the flood of workers and timeouts of our scripts, I can 
> still run individual queries and get back responses. 
>
> On Wednesday, March 16, 2016 at 9:41:23 AM UTC-7, [email protected] 
> wrote:
>>
>> Correct, DataDog. I can add that code into our WSGI scripts in dev and 
>> see how it works. Will report back.
>>
>> On Tuesday, March 15, 2016 at 7:38:59 PM UTC-7, Graham Dumpleton wrote:
>>>
>>> Okay, is DataDog. Thought it was but first charts found on their web 
>>> site didn’t show the legend.
>>>
>>> On 16 Mar 2016, at 1:36 PM, Graham Dumpleton <[email protected]> 
>>> wrote:
>>>
>>> What is the monitoring system you are using? The UI looks familiar, but 
>>> can’t remember what system that is from.
>>>
>>> How hard would it be for you to add a bit of Python code to the WSGI 
>>> script file for your application which starts a background thread that 
>>> reports some extra metrics on a periodic basis?
>>>
>>> Also, the fact that it appears to be backlogged looks a bit like stuck 
>>> requests in the Python web application so causing an effect in the Apache 
>>> child worker processes as shown by your monitoring. The added metric I am 
>>> thinking of would confirm that.
>>>
>>> A more brute force way of tracking down if requests are getting stuck is 
>>> to add to your WSGI script file:
>>>
>>>
>>> http://modwsgi.readthedocs.org/en/develop/user-guides/debugging-techniques.html#extracting-python-stack-traces
>>>
>>> That way when backlogging occurs and busy workers increases, can force 
>>> logging of what Python threads in web application is doing at that point. 
>>> If threads are stuck, will tell you where.
>>>
>>> Graham
>>>
>>> On 16 Mar 2016, at 1:21 PM, [email protected] wrote:
>>>
>>> Clarifying on the first line - In our testing, our client is requesting 
>>> at 3 requests per second. There could be more, but it should not exceed 6.
>>>
>>> The request handlers are waiting on a web request that is spawned to 
>>> another server which then queries the database. The CPU load is so low it 
>>> barely crosses 3% and that's at a high peak. We are typically below 1%.
>>>
>>> Size of the request payload is small and is merely a simple query, 
>>> though requests can vary in size and range from roughly 3KB to 100KB.  
>>>
>>> Attached is a screenshot of our logging that is capturing 
>>> busy/idle/queries on a timeline. Where the yellow line goes to zero and the 
>>> workers start to increase is where we begin to see timeouts. The eventual 
>>> dip after the peak is me bouncing the apache damon in order to get it back 
>>> under some control.
>>>
>>> On Tuesday, March 15, 2016 at 6:35:13 PM UTC-7, Graham Dumpleton wrote:
>>>>
>>>>
>>>> On 16 Mar 2016, at 12:10 PM, [email protected] wrote:
>>>>
>>>> I am hoping to gain some clarity here on our WSGI configuration since a 
>>>> lot of the tuning seems to be heavily reliant on the application itself. 
>>>>
>>>> Our setup
>>>>
>>>>    - Single load balancer (round robin)
>>>>    - Two virtual servers with 16GB of RAM
>>>>    - Python app ~100MB in memory per process
>>>>    - Response times are longer as we broker calls, so it could be up 
>>>>    to 1-2 seconds
>>>>    - Running WSGI 4.4.2 on Ubuntu LTS 14 with Apache 2
>>>>    - WSGI Daemon mode running (30 processes with 25 threads)
>>>>    - KeepAlives are off
>>>>    - WSGI Restrict embedded is on
>>>>    - Using MPM event
>>>>
>>>> For Apache, we have the following:
>>>>
>>>>    - StartServers 30
>>>>    - MinSpareThreads 40
>>>>    - MaxSpareThreads 150
>>>>    - ThreadsPerChild 25
>>>>    - MaxRequestWorkers 600
>>>>
>>>> I have tried a number of different scenarios, but all of them generally 
>>>> lead to the same problem. We are processing about 3 requests a second with 
>>>> a steady number of worker threads and plenty of idle in place. After a few 
>>>> minutes of sustained traffic, we eventually start timing out which then 
>>>> leads to worker counts driving up until it's reached the 
>>>> MaxRequestWorkers. 
>>>> Despite this, I am still able to issue requests and get responses, but it 
>>>> ultimately leads to apache becoming unresponsive. 
>>>>
>>>>
>>>> Just to confirm. You say that you never go above 3 requests per second, 
>>>> but that at worst case those requests can take 2 seconds. Correct?
>>>>
>>>> Are the request handlers predominantly waiting on backend database 
>>>> calls, or are they doing more CPU intensive work? What is the CPU load on 
>>>> the mod_wsgi daemon processes?
>>>>
>>>> Also, what is the size of payloads, for requests and responses?
>>>>
>>>> Graham
>>>>
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "modwsgi" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/modwsgi.
>>> For more options, visit https://groups.google.com/d/optout.
>>> <Screen Shot 2016-03-15 at 7.19.45 PM.png>
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Reply via email to