On 20 November 2011 14:08, Rodrigo Campos <[email protected]> wrote:
> Hi,
>
> I have an application that is freezing at some random times. It
> doesn't seem to be obviously related with anything. And restarting
> apache seems to fix it.
>
> My first guess was to lower the "maximum-requests" used in WSGI from
> 500 to 200 to avoid any leak that we might be hitting (open fds, mem,
> or something like that), change the LogLevel from warning to info in
> apache config and add "inactivity-timeout=600". Just in case, the WSGI
> relevant parts in the apache vhost looks like this now:
>
>  WSGIDaemonProcess <name> processes=4 maximum-requests=200 threads=1
> inactivity-timeout=600
>  WSGIProcessGroup <some_user>
>  WSGIScriptAlias / /<path>/apache.wsgi
>  WSGIApplicationGroup %{GLOBAL}
>
>
> But with this same config, it happened again today. On the error logs
> during that time there several errors like this one:
>
>  [Sat Nov 19 02:59:28 2011] [error] [client 127.0.0.1] Script timed
> out before returning headers: apache.wsgi
>
> In fact, there are 47 of them. And then 3 like this:
>
>  [Sat Nov 19 08:02:25 2011] [info] mod_wsgi (pid=11550): Daemon
> process inactivity timer expired, stopping process '<name>'.
>
> And 1 like:
>
>  [Sat Nov 19 08:03:06 2011] [info] mod_wsgi (pid=18334): Maximum
> requests reached '<name>'.
>  [Sat Nov 19 08:03:06 2011] [info] mod_wsgi (pid=18334): Shutdown
> requested '<name>'.
>  [Sat Nov 19 08:03:06 2011] [info] mod_wsgi (pid=18334): Stopping
> process '<name>'.
>  [Sat Nov 19 08:03:06 2011] [info] mod_wsgi (pid=18334): Destroying
> interpreters.
>  [Sat Nov 19 08:03:06 2011] [info] mod_wsgi (pid=18334): Cleanup
> interpreter ''.
>  [Sat Nov 19 08:03:07 2011] [info] mod_wsgi (pid=18334): Terminating Python.
>  [Sat Nov 19 08:03:07 2011] [info] mod_wsgi (pid=18334): Python has shutdown.
>
>
> After this it all seems to be normal and working again. So it seems
> the inactivity time out parameter worked as expected, it wasn't
> necessary to restart apache this time.
>
> Also, it seems there is something wrong with the date/hour I didn't
> yet look. It didn't happen so much time between the "Script timed out"
> and the killing. It seems some messages are logged using one time and
> some others using other timezone/something like that. Because I have
> this in the logs, for example:

The timezone difference will be because one log message comes from
Apache child process and the other from the mod_wsgi daemon process.
If Django is overriding timezone then what is displayed will be
different to what Apache child processes show.

> [Sat Nov 19 08:04:31 2011] [info] mod_wsgi (pid=13853): Aborting
> process '<name>'.
> [Sat Nov 19 03:04:31 2011] [error] [client 127.0.0.1] Premature end of
> script headers: apache.wsgi
> [Sat Nov 19 08:04:31 2011] [info] mod_wsgi (pid=18356): Python has shutdown.
> [Sat Nov 19 03:04:32 2011] [info] mod_wsgi (pid=18404): Attach interpreter ''.
>
>
> But the time is actually bad configured on that server (is not on our
> timezone and I will fix it as soon I can confirm the application
> wouldn't have any problems with that), so perhaps it is reported in
> some place as one timezone and some other in some other config file.
>
> This application uses a Mysql (percona, a fork of mysql actually)
> hosted on some other machine. Looking the munin and new-relic stats I
> don't see anything weird. Although in muning, with the mysql plugin, I
> see a huge increase for some minutes of "binlog cache usage". That's
> the only "anomaly" I see.
>
> Anyone has any idea/tip on how can I further debug this ?
>
> I've seen there are commits in WSGI (not included in any release yet)
> that add a blocked timeout and blocker request parameters

Yep. The inactivity-timeout has been serving double duty. It was
really intended for when the process is completely idle with no
requests arriving or being handled. In place of a better solution at
the time, also made it so that it would detect case where no requests
arriving but there were existing requests and for the total of those
active requests there was no input or output. So, worked as a poor
failsafe for when the process as a whole hung because of all requests
blocking for some reason.

In mod_wsgi 4.0 those concepts are properly separated. The
inactivity-timeout is only for case where process is completely idle,
no active requests at all.

The new options for the busy case are blocked-requests and
blocked-timeout. By default the blocked restart is not enabled. It can
be enabled by setting blocked-timeout in much the same was as
inactivity-timeout option. When blocked-requests is not set, it
defaults to what ever the number of threads is for the mod_wsgi daemon
process.

What happens is that when the blocked-timeout is defined, when the
number of concurrent requests is equal to or exceeds blocked-requests,
if there is no input or output for any requests for period set by
blocked-timeout, then the process will be restarted.

So, if blocked-requests isn't set and defaults to total threads, then
when whole process effectively hangs, then it will restart.

If blocked-requests is set lower, for example, to half total threads,
then behaviour is a bit different. In this case the check for input
and output starts at a lower number of potentially blocked requests.
So, the checks for no activity against blocked timeout period will
happen, but not all threads will be in blocked state at this point and
so new requests could still be accepted and handled. If it so happened
that you had a lull period where no requests arrived and so timeout
expired with the requests still blocked, then the process will be
restarted.

Setting blocked-requests to a lower value than total number of threads
therefore allows you to potentially have some head room where can
still handle requests but will try and do a restart if can in an idle
period. So, a form of graceful restart of sorts. In practice though,
if you are under a constant request load, it is unlikely this will
occur.

I need to think about how this all works some more and see whether I
can better determine if individual requests are producing no input and
output and so likely blocked. I may yet be able to come up with a
slightly better way of handling this and cause a restart sooner as
soon as no other active requests in addition to the blocked ones. That
way will not have situation where in busy site will still end up
reaching maximum number of threads being blocked before forced
restart.

> and an other
> one to get some information of the processes/threads (which
> information ?

This is some experimental stuff that works in conjunction with New
Relic agent to track number of concurrent requests and thread
utilisation. It can be graphed in New Relic using custom views. I am
not entirely happy with interface into mod_wsgi for this at this point
and so will be changing how this works.

> a call-trace perhaps ? :-)).

I have just committed some changes that adds a feature where by if a
restart is performed due to blocked requests then it will dump out
what any active requests in the process are doing. You would therefore
see in the Apache logs:

[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Daemon process
busy inactivity timer expired, stopping process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Shutdown
requested 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Dumping stack
trace for active Python threads.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Thread
4320296960 executing file
"/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 8, in
function
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): called from
file "/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 12,
in application.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Thread
4320833536 executing file
"/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 8, in
function
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): called from
file "/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 12,
in application.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Thread
4321370112 executing file
"/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 8, in
function
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): called from
file "/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 12,
in application.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Thread
4313858048 executing file
"/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 8, in
function
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): called from
file "/Library/WebServer/Sites/hello-1/htdocs/environ.wsgi", line 12,
in application.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
10 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
9 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
5 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
8 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
7 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
6 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
4 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
3 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
2 in daemon process 'hello-1'.
[Mon Nov 21 15:18:52 2011] [info] mod_wsgi (pid=12256): Exiting thread
1 in daemon process 'hello-1'.
[Mon Nov 21 15:18:57 2011] [info] mod_wsgi (pid=12256): Aborting
process 'hello-1'.
[Mon Nov 21 15:18:57 2011] [error] [client 127.0.0.1] Premature end of
script headers: environ.wsgi
[Mon Nov 21 15:18:57 2011] [error] [client 127.0.0.1] Premature end of
script headers: environ.wsgi
[Mon Nov 21 15:18:57 2011] [error] [client 127.0.0.1] Premature end of
script headers: environ.wsgi
[Mon Nov 21 15:18:57 2011] [error] [client 127.0.0.1] Premature end of
script headers: environ.wsgi

It does require LogLevel in Apache to be set to 'info' though. There
is only file name, line number and function name context. It isn't
really feasible to give code line like normal Python tracebacks as
this is reaching from underneath all interpreters and written in C
code. It is happening on shutdown and also has to avoid doing stuff
that might result in GIL being released and other code being run. So,
has to be simple as possible. Should be sufficient though in helping
to target trouble spot.

> If I understand correctly
> this could be *very* helpful to debug this problem. Also, if we could
> have a trace where the blocked process where hanging it will be *very
> very* useful :-D
>
> Is there some other configuration option that could help ? Do you know
> when will be a new release that includes the "blocked request"
> parameters ?

Above is all I can think of right now in rush before have to go back
home to no proper internet.

Graham

> Oh, btw, I'm using libapache2-mod-wsg 3.3-2ubuntu2
>
>
> Also, if you need more information, please let me know.
>
>
>
>
>
> Thanks a lot,
> Rodrigo
>
> --
> You received this message because you are subscribed to the Google Groups 
> "modwsgi" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/modwsgi?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/modwsgi?hl=en.

Reply via email to