Re: [modwsgi] True graceful restarts for mod_wsgi daemon mode

Tomi Belan Wed, 28 Sep 2022 04:16:44 -0700

Hi Kent. There are no updates from my side, I'm afraid. Because of the
reasons I mentioned, I think this issue cannot be fixed without drastic
changes to mod_wsgi and/or Apache itself, at least not the way I thought of.


Good point about SSL certificates. You're correct. I also saw this
mentioned in the documentation of mod_md. Perhaps you could reduce the
frequency of graceful restarts to lower the probability of interrupting a
request, but...

I'm exploring migrating my service to Gunicorn, but it has its own
challenges. I already miss mod_wsgi's easy deployment, direct access to
Apache request variables, and great documentation.

On Wed, Sep 28, 2022 at 12:21 PM Kent <[email protected]> wrote:

> Tomi / Graham,
>
> I don't imagine there are any updates or workarounds on this?
>
> More specifically in my case, it is just for picking up new SSL
> certificates. When Apache needs reload config to pick up new SSL
> certificates and the wsgi app is running in daemon mode, there is no true
> way to make that graceful, or is there?
> As far as I know, you need to signal the parent apache process with
> SIGUSR1 (like apache2ctl graceful does), which ends up murdering daemon
> wsgi processes ungracefully.
>
> Let me know please if there are any ideas,
> Kent
>
>
> On Tuesday, May 17, 2022 at 11:45:52 AM UTC-4 [email protected] wrote:
>
>> I didn't get far:
>>
>> The main obstacle I found is that Apache uses dlclose() and dlopen() to
>> unload and reload all module .so files during graceful reload. So
>> registering a cleanup function on a long lived pool such as ap_pglobal or
>> any similar trick just won't work. Any function pointers from mod_wsgi.so
>> may become invalid. Normal data can be stored with ap_retained_data_get(),
>> but not function pointers. See also
>> <https://cwiki.apache.org/confluence/display/httpd/ModuleLife>.
>>
>> It is even possible to add or remove LoadModule commands during graceful
>> reload. So we might be dealing with a graceful reload where mod_wsgi should
>> nevertheless shut down immediately. If we were to blindly assume that
>> Apache graceful reload means mod_wsgi is also about to reload, it can lead
>> to dangling child processes. But there is most likely no way to find out
>> during the old mod_wsgi's cleanup, because the new config wasn't parsed yet.
>>
>> I glanced at mod_fcgid. Unlike its modern replacement mod_proxy_fastcgi,
>> it can spawn FastCGI services directly. I don't know if mod_fcgid handles
>> all this stuff correctly, but I noticed it works by spawning a "mod_fcgid
>> process manager" process, which then spawns all other children as needed. I
>> guess something like that could work. Spawning a separate "mod_wsgi
>> manager" process just once on first init and registering it
>> with apr_pool_note_subprocess(ap_pglobal, ...) might do the trick -- to
>> make sure that it gets cleaned up and avoid all the issues with function
>> pointers or unloading/reloading of mod_wsgi. But I feel it's too big a
>> change, with too many moving pieces and too much that can go wrong.
>>
>> In conclusion I'd say mod_wsgi is at a local maximum. Its handling of
>> graceful reloads is not the best, but it's good enough for most users, and
>> given Apache's design and public API I don't think any easy fix exists.
>>
>> That's probably all from me on this topic. It's a pity I didn't succeed,
>> but I still had fun. So long. :)
>>
>> On Sun, May 8, 2022 at 5:01 AM Tomi Belan <[email protected]> wrote:
>>
>>> I didn't expect such a fast answer! Thank you!
>>>
>>> I'm definitely interested if you have any other thoughts about writing a
>>> custom process manager. Especially any potential issues or edge cases that
>>> must be taken care of.
>>>
>>> I will probably try my hand at it just for fun, but I'm not at all
>>> familiar with Apache and mod_wsgi internals, so it's pretty daunting. It
>>> probably won't go anywhere.
>>>
>>> Looking at the code, MPM modules do have some superpowers, such as
>>> access to struct ap_unixd_mpm_retained_data. Normal modules will have a
>>> harder time distinguishing between a graceful restart and full shutdown.
>>> Maybe by registering one cleanup function on pconf and another one on
>>> ap_pglobal...? Who knows.
>>>
>>> As for my app:
>>> Partitioning by URL is an interesting idea. Sadly it won't work for my
>>> app, because almost every request can write these files, and the URL
>>> doesn't reveal which requests may be slow. Plus we're forced to use prefork
>>> because we need a certain ancient single-sign-on module which is not thread
>>> safe. Plus we probably can't use embedded mode anyway, because the server
>>> runs two wsgi apps with different virtualenvs, and it needs
>>> "WSGIApplicationGroup %{GLOBAL}" for the lxml library. As I understand it,
>>> embedded mode can't do that. Currently they are two daemon process-groups.
>>> If I'm being honest with myself, the most pragmatic solution might be to
>>> switch to Gunicorn. ;) But even if it comes to that, this puzzle still
>>> interests me. It would be neat to find a proper solution, whether I
>>> ultimately use it in production or not.
>>>
>>> On Sunday, May 8, 2022 at 1:31:33 AM UTC+2 Graham Dumpleton wrote:
>>>
>>>> Fixing my bad edit at the end so makes proper sense:
>>>>
>>>> Reason am pointing at that is that if there is only one URL of your
>>>> application which is writing these files, then you could consider
>>>> delegating just that one URL to be handled under mod_wsgi embedded mode,
>>>> rather than in the daemon mode process with the rest of your application
>>>> code. As long as the request handler for that doesn't drag in too much
>>>> code, and aren't using prefork MPM, the memory cost in Apache child
>>>> processes may be manageable. By having that one URL be handled in daemon
>>>> mode, then the processes it runs in will be handled under the graceful
>>>> restart mode of the main Apache child processes.
>>>>
>>>> On 8 May 2022, at 9:27 am, Graham Dumpleton <[email protected]>
>>>> wrote:
>>>>
>>>> It definitely is an annoying problem. To be honest I don't think I have
>>>> ever really considered writing my own sub process manager instead of using
>>>> the Apache other processes management code. I will need to think about why
>>>> I never considered doing that and how complicated would be to replicate.
>>>>
>>>> As to an interim solution, have a read of:
>>>>
>>>> http://blog.dscpl.com.au/2014/02/vertically-partitioning-python-web.html
>>>>
>>>> Reason am pointing at that is that if there is only one URL of your
>>>> application which is writing these files, then you could consider
>>>> delegating just that one URL to be handled under mod_wsgi embedded mode,
>>>> rather than in the daemon mode process with the rest of your application
>>>> code and aren't using preform MPM. As long as the request handler for that
>>>> doesn't drag in too much code, the memory cost in Apache child processes
>>>> may be manageable. By having that one URL be handled in daemon mode, then
>>>> the processes it runs in will be handled under the graceful restart mode of
>>>> the main Apache child processes.
>>>>
>>>> Graham
>>>>
>>>> On 8 May 2022, at 9:16 am, Tomi Belan <[email protected]> wrote:
>>>>
>>>> How much work would it take to have true graceful restarts for the
>>>> mod_wsgi daemon processes?
>>>>
>>>> Current behavior:
>>>> When "apache2ctl graceful" aka "httpd -k graceful" runs, the Apache
>>>> parent process sends a SIGTERM to each mod_wsgi daemon process, waits up to
>>>> 3 seconds (hardcoded maximum), and sends a SIGKILL to any that are still
>>>> alive. After they're all dead, it spawns new wsgi processes. This is
>>>> mentioned in various issues like #383
>>>> <https://github.com/GrahamDumpleton/mod_wsgi/issues/383> and #124
>>>> <https://github.com/GrahamDumpleton/mod_wsgi/issues/124>, and in the
>>>> documentation of WSGIDaemonProcess shutdown-timeout
>>>> <https://modwsgi.readthedocs.io/en/master/configuration-directives/WSGIDaemonProcess.html#:~:text=shutdown%2Dtimeout>
>>>> .
>>>> In contrast, Apache sends SIGUSR1 to its own worker processes, and
>>>> whenever one of them exits, Apache spawns a new one. So there should almost
>>>> always be enough processes ready to serve new connections. (
>>>> https://httpd.apache.org/docs/2.4/stopping.html#graceful)
>>>>
>>>> My wishlist for "true" graceful restarts would be:
>>>> 1. Make the shutdown timeout configurable.
>>>> 2. Don't wait until *all* old daemon processes exit. Either spawn 1 new
>>>> process whenever 1 old process exits, or spawn all N new processes
>>>> immediately and let the old processes exit when they want.
>>>> 3. Add another signal between the SIGTERM and SIGKILL which throws a
>>>> Python exception, so that "finally:" blocks have a chance to run.
>>>>
>>>> Current code:
>>>> The linked github issues did mention that this behavior is hardcoded
>>>> deep in Apache and there is nothing mod_wsgi can do, but I wanted to see it
>>>> myself.
>>>> Actually, the logic is not anywhere in https://github.com/apache/httpd
>>>> (in particular, it's NOT server/mpm_unix.c
>>>> <https://github.com/apache/httpd/blob/trunk/server/mpm_unix.c>), but
>>>> in https://github.com/apache/apr. Specifically the SIGKILL is sent at
>>>> apr/memory/unix/apr_pools.c#L2810
>>>> <https://github.com/apache/apr/blob/39c271bca156adee03ff49f864dcce27ae6f5d73/memory/unix/apr_pools.c#L2810>
>>>>  and
>>>> the 3 seconds timeout is hardcoded at apr/memory/unix/apr_pools.c#L98
>>>> <https://github.com/apache/apr/blob/39c271bca156adee03ff49f864dcce27ae6f5d73/memory/unix/apr_pools.c#L98>.
>>>> Any subprocess registered with apr_pool_note_subprocess(...,
>>>> APR_KILL_AFTER_TIMEOUT) will use that timeout. mod_wsgi calls that function
>>>> at server/mod_wsgi.c#L10566
>>>> <https://github.com/GrahamDumpleton/mod_wsgi/blob/dabb377a29cba190c6c48659e3f81df685e47aad/src/server/mod_wsgi.c#L10566>
>>>> .
>>>> The pool where the subprocesses are registered is the pconf pool given
>>>> to wsgi_hook_init. I guess they are probably killed when Apache
>>>> calls apr_pool_clear(process->pconf) in reset_process_pconf() in main.c,
>>>> but I haven't verified this.
>>>> The normal worker process logic is implemented in each mpm. E.g.
>>>> prefork replaces dead children with new live children at
>>>> server/mpm/prefork/prefork.c#L1145
>>>> <https://github.com/apache/httpd/blob/6596870481dc1f0e28ac59c52455691fee9c8524/server/mpm/prefork/prefork.c#L1145>,
>>>> I think.
>>>>
>>>> My thoughts: (please correct me if I'm wrong)
>>>> This seems pretty hard. I definitely see why it wasn't done yet. And
>>>> maybe it's not worth the complexity even if it is possible.
>>>> Originally I hoped I could just write an Apache patch to replace the
>>>> hardcoded timeout value with a config file option. But the logic is in a
>>>> library (apr) so I can't read Apache config directly, and there might be
>>>> API/ABI concerns with extending apr_pool_note_subprocess(). And anyway,
>>>> *only* making the timeout configurable wouldn't be enough because the
>>>> server would just wait without any mod_wsgi process accepting new
>>>> connections.
>>>> I think the best chance of success would be to stop using apr_pool_t
>>>> and apr_pool_note_subprocess() for process management in mod_wsgi. After
>>>> all, it's not the only way: Either use fork() etc directly, like the mpm
>>>> modules, or at least, keep apr_pool_t but use our own custom pool rather
>>>> than "pconf" - most likely saved with ap_retained_data_get(). That way
>>>> mod_wsgi would have more control. When it learns the server is gracefully
>>>> restarting, it will spawn new daemon processes immediately with a new
>>>> socket name, and timeout/kill the old processes later in the background.
>>>> When it learns the server is stopping, it will block until the children are
>>>> terminated.
>>>>
>>>> Does this make sense? Are there any glaring issues I've overlooked?
>>>>
>>>> If the strategy sounds sensible, and if I have enough time, I might try
>>>> to code this. Is it something you would be potentially interested in
>>>> merging? (not too much code review burden, maintenance burden, or risk of
>>>> new bugs)
>>>>
>>>> Just for completeness, the backstory of why I want this:
>>>> My Python app writes files to disk. Sadly, some requests take more than
>>>> 3 seconds. If it is killed with SIGKILL, the file buffer data is
>>>> not written, resulting in a corrupted empty/truncated file. A later batch
>>>> process fails when it tries to read every file in the output directory. I
>>>> know there are many workarounds, such as using a temporary file and
>>>> atomically renaming it, but I became curious about the root cause.
>>>> The server gracefully restarts every day because of log rotation, using
>>>> Ubuntu's default logrotate config. After reading #383
>>>> <https://github.com/GrahamDumpleton/mod_wsgi/issues/383> I also looked
>>>> at Apache's rotatelogs
>>>> <https://httpd.apache.org/docs/2.4/programs/rotatelogs.html>, but it
>>>> doesn't support compression, so I'd rather stay with logrotate.
>>>>
>>>> Versions: Apache 2.4.41 with mpm_prefork, mod_wsgi 4.6.8 in daemon
>>>> mode, Python 3.8.10, Ubuntu 20.04. (old but I don't think this matters)
>>>>
>>>> Tomi
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "modwsgi" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/modwsgi/CACUV5oemMwr1YzKe%3D0JrBTma%2BwQcvyaN5Jzc5uz_Kf31mK12ng%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/modwsgi/CACUV5oemMwr1YzKe%3D0JrBTma%2BwQcvyaN5Jzc5uz_Kf31mK12ng%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>>
>>>>
>>>> --
>>>
>> You received this message because you are subscribed to a topic in the
>>> Google Groups "modwsgi" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/modwsgi/ZqlJLOZGb5I/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/modwsgi/6f3de9e7-d045-4b15-b771-956915c0ec32n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/modwsgi/6f3de9e7-d045-4b15-b771-956915c0ec32n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "modwsgi" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/modwsgi/ZqlJLOZGb5I/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/modwsgi/e562c850-192e-49b9-8207-b67ba4f6b027n%40googlegroups.com
> <https://groups.google.com/d/msgid/modwsgi/e562c850-192e-49b9-8207-b67ba4f6b027n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/modwsgi/CACUV5odSMNVA364wtu_uef1Y343BtCxUkSfsMX3ERE95gb3mVw%40mail.gmail.com.

Re: [modwsgi] True graceful restarts for mod_wsgi daemon mode

Reply via email to