Hi Kent. There are no updates from my side, I'm afraid. Because of the reasons I mentioned, I think this issue cannot be fixed without drastic changes to mod_wsgi and/or Apache itself, at least not the way I thought of.
Good point about SSL certificates. You're correct. I also saw this mentioned in the documentation of mod_md. Perhaps you could reduce the frequency of graceful restarts to lower the probability of interrupting a request, but... I'm exploring migrating my service to Gunicorn, but it has its own challenges. I already miss mod_wsgi's easy deployment, direct access to Apache request variables, and great documentation. On Wed, Sep 28, 2022 at 12:21 PM Kent <[email protected]> wrote: > Tomi / Graham, > > I don't imagine there are any updates or workarounds on this? > > More specifically in my case, it is just for picking up new SSL > certificates. When Apache needs reload config to pick up new SSL > certificates and the wsgi app is running in daemon mode, there is no true > way to make that graceful, or is there? > As far as I know, you need to signal the parent apache process with > SIGUSR1 (like apache2ctl graceful does), which ends up murdering daemon > wsgi processes ungracefully. > > Let me know please if there are any ideas, > Kent > > > On Tuesday, May 17, 2022 at 11:45:52 AM UTC-4 [email protected] wrote: > >> I didn't get far: >> >> The main obstacle I found is that Apache uses dlclose() and dlopen() to >> unload and reload all module .so files during graceful reload. So >> registering a cleanup function on a long lived pool such as ap_pglobal or >> any similar trick just won't work. Any function pointers from mod_wsgi.so >> may become invalid. Normal data can be stored with ap_retained_data_get(), >> but not function pointers. See also >> <https://cwiki.apache.org/confluence/display/httpd/ModuleLife>. >> >> It is even possible to add or remove LoadModule commands during graceful >> reload. So we might be dealing with a graceful reload where mod_wsgi should >> nevertheless shut down immediately. If we were to blindly assume that >> Apache graceful reload means mod_wsgi is also about to reload, it can lead >> to dangling child processes. But there is most likely no way to find out >> during the old mod_wsgi's cleanup, because the new config wasn't parsed yet. >> >> I glanced at mod_fcgid. Unlike its modern replacement mod_proxy_fastcgi, >> it can spawn FastCGI services directly. I don't know if mod_fcgid handles >> all this stuff correctly, but I noticed it works by spawning a "mod_fcgid >> process manager" process, which then spawns all other children as needed. I >> guess something like that could work. Spawning a separate "mod_wsgi >> manager" process just once on first init and registering it >> with apr_pool_note_subprocess(ap_pglobal, ...) might do the trick -- to >> make sure that it gets cleaned up and avoid all the issues with function >> pointers or unloading/reloading of mod_wsgi. But I feel it's too big a >> change, with too many moving pieces and too much that can go wrong. >> >> In conclusion I'd say mod_wsgi is at a local maximum. Its handling of >> graceful reloads is not the best, but it's good enough for most users, and >> given Apache's design and public API I don't think any easy fix exists. >> >> That's probably all from me on this topic. It's a pity I didn't succeed, >> but I still had fun. So long. :) >> >> On Sun, May 8, 2022 at 5:01 AM Tomi Belan <[email protected]> wrote: >> >>> I didn't expect such a fast answer! Thank you! >>> >>> I'm definitely interested if you have any other thoughts about writing a >>> custom process manager. Especially any potential issues or edge cases that >>> must be taken care of. >>> >>> I will probably try my hand at it just for fun, but I'm not at all >>> familiar with Apache and mod_wsgi internals, so it's pretty daunting. It >>> probably won't go anywhere. >>> >>> Looking at the code, MPM modules do have some superpowers, such as >>> access to struct ap_unixd_mpm_retained_data. Normal modules will have a >>> harder time distinguishing between a graceful restart and full shutdown. >>> Maybe by registering one cleanup function on pconf and another one on >>> ap_pglobal...? Who knows. >>> >>> As for my app: >>> Partitioning by URL is an interesting idea. Sadly it won't work for my >>> app, because almost every request can write these files, and the URL >>> doesn't reveal which requests may be slow. Plus we're forced to use prefork >>> because we need a certain ancient single-sign-on module which is not thread >>> safe. Plus we probably can't use embedded mode anyway, because the server >>> runs two wsgi apps with different virtualenvs, and it needs >>> "WSGIApplicationGroup %{GLOBAL}" for the lxml library. As I understand it, >>> embedded mode can't do that. Currently they are two daemon process-groups. >>> If I'm being honest with myself, the most pragmatic solution might be to >>> switch to Gunicorn. ;) But even if it comes to that, this puzzle still >>> interests me. It would be neat to find a proper solution, whether I >>> ultimately use it in production or not. >>> >>> On Sunday, May 8, 2022 at 1:31:33 AM UTC+2 Graham Dumpleton wrote: >>> >>>> Fixing my bad edit at the end so makes proper sense: >>>> >>>> Reason am pointing at that is that if there is only one URL of your >>>> application which is writing these files, then you could consider >>>> delegating just that one URL to be handled under mod_wsgi embedded mode, >>>> rather than in the daemon mode process with the rest of your application >>>> code. As long as the request handler for that doesn't drag in too much >>>> code, and aren't using prefork MPM, the memory cost in Apache child >>>> processes may be manageable. By having that one URL be handled in daemon >>>> mode, then the processes it runs in will be handled under the graceful >>>> restart mode of the main Apache child processes. >>>> >>>> On 8 May 2022, at 9:27 am, Graham Dumpleton <[email protected]> >>>> wrote: >>>> >>>> It definitely is an annoying problem. To be honest I don't think I have >>>> ever really considered writing my own sub process manager instead of using >>>> the Apache other processes management code. I will need to think about why >>>> I never considered doing that and how complicated would be to replicate. >>>> >>>> As to an interim solution, have a read of: >>>> >>>> http://blog.dscpl.com.au/2014/02/vertically-partitioning-python-web.html >>>> >>>> Reason am pointing at that is that if there is only one URL of your >>>> application which is writing these files, then you could consider >>>> delegating just that one URL to be handled under mod_wsgi embedded mode, >>>> rather than in the daemon mode process with the rest of your application >>>> code and aren't using preform MPM. As long as the request handler for that >>>> doesn't drag in too much code, the memory cost in Apache child processes >>>> may be manageable. By having that one URL be handled in daemon mode, then >>>> the processes it runs in will be handled under the graceful restart mode of >>>> the main Apache child processes. >>>> >>>> Graham >>>> >>>> On 8 May 2022, at 9:16 am, Tomi Belan <[email protected]> wrote: >>>> >>>> How much work would it take to have true graceful restarts for the >>>> mod_wsgi daemon processes? >>>> >>>> Current behavior: >>>> When "apache2ctl graceful" aka "httpd -k graceful" runs, the Apache >>>> parent process sends a SIGTERM to each mod_wsgi daemon process, waits up to >>>> 3 seconds (hardcoded maximum), and sends a SIGKILL to any that are still >>>> alive. After they're all dead, it spawns new wsgi processes. This is >>>> mentioned in various issues like #383 >>>> <https://github.com/GrahamDumpleton/mod_wsgi/issues/383> and #124 >>>> <https://github.com/GrahamDumpleton/mod_wsgi/issues/124>, and in the >>>> documentation of WSGIDaemonProcess shutdown-timeout >>>> <https://modwsgi.readthedocs.io/en/master/configuration-directives/WSGIDaemonProcess.html#:~:text=shutdown%2Dtimeout> >>>> . >>>> In contrast, Apache sends SIGUSR1 to its own worker processes, and >>>> whenever one of them exits, Apache spawns a new one. So there should almost >>>> always be enough processes ready to serve new connections. ( >>>> https://httpd.apache.org/docs/2.4/stopping.html#graceful) >>>> >>>> My wishlist for "true" graceful restarts would be: >>>> 1. Make the shutdown timeout configurable. >>>> 2. Don't wait until *all* old daemon processes exit. Either spawn 1 new >>>> process whenever 1 old process exits, or spawn all N new processes >>>> immediately and let the old processes exit when they want. >>>> 3. Add another signal between the SIGTERM and SIGKILL which throws a >>>> Python exception, so that "finally:" blocks have a chance to run. >>>> >>>> Current code: >>>> The linked github issues did mention that this behavior is hardcoded >>>> deep in Apache and there is nothing mod_wsgi can do, but I wanted to see it >>>> myself. >>>> Actually, the logic is not anywhere in https://github.com/apache/httpd >>>> (in particular, it's NOT server/mpm_unix.c >>>> <https://github.com/apache/httpd/blob/trunk/server/mpm_unix.c>), but >>>> in https://github.com/apache/apr. Specifically the SIGKILL is sent at >>>> apr/memory/unix/apr_pools.c#L2810 >>>> <https://github.com/apache/apr/blob/39c271bca156adee03ff49f864dcce27ae6f5d73/memory/unix/apr_pools.c#L2810> >>>> and >>>> the 3 seconds timeout is hardcoded at apr/memory/unix/apr_pools.c#L98 >>>> <https://github.com/apache/apr/blob/39c271bca156adee03ff49f864dcce27ae6f5d73/memory/unix/apr_pools.c#L98>. >>>> Any subprocess registered with apr_pool_note_subprocess(..., >>>> APR_KILL_AFTER_TIMEOUT) will use that timeout. mod_wsgi calls that function >>>> at server/mod_wsgi.c#L10566 >>>> <https://github.com/GrahamDumpleton/mod_wsgi/blob/dabb377a29cba190c6c48659e3f81df685e47aad/src/server/mod_wsgi.c#L10566> >>>> . >>>> The pool where the subprocesses are registered is the pconf pool given >>>> to wsgi_hook_init. I guess they are probably killed when Apache >>>> calls apr_pool_clear(process->pconf) in reset_process_pconf() in main.c, >>>> but I haven't verified this. >>>> The normal worker process logic is implemented in each mpm. E.g. >>>> prefork replaces dead children with new live children at >>>> server/mpm/prefork/prefork.c#L1145 >>>> <https://github.com/apache/httpd/blob/6596870481dc1f0e28ac59c52455691fee9c8524/server/mpm/prefork/prefork.c#L1145>, >>>> I think. >>>> >>>> My thoughts: (please correct me if I'm wrong) >>>> This seems pretty hard. I definitely see why it wasn't done yet. And >>>> maybe it's not worth the complexity even if it is possible. >>>> Originally I hoped I could just write an Apache patch to replace the >>>> hardcoded timeout value with a config file option. But the logic is in a >>>> library (apr) so I can't read Apache config directly, and there might be >>>> API/ABI concerns with extending apr_pool_note_subprocess(). And anyway, >>>> *only* making the timeout configurable wouldn't be enough because the >>>> server would just wait without any mod_wsgi process accepting new >>>> connections. >>>> I think the best chance of success would be to stop using apr_pool_t >>>> and apr_pool_note_subprocess() for process management in mod_wsgi. After >>>> all, it's not the only way: Either use fork() etc directly, like the mpm >>>> modules, or at least, keep apr_pool_t but use our own custom pool rather >>>> than "pconf" - most likely saved with ap_retained_data_get(). That way >>>> mod_wsgi would have more control. When it learns the server is gracefully >>>> restarting, it will spawn new daemon processes immediately with a new >>>> socket name, and timeout/kill the old processes later in the background. >>>> When it learns the server is stopping, it will block until the children are >>>> terminated. >>>> >>>> Does this make sense? Are there any glaring issues I've overlooked? >>>> >>>> If the strategy sounds sensible, and if I have enough time, I might try >>>> to code this. Is it something you would be potentially interested in >>>> merging? (not too much code review burden, maintenance burden, or risk of >>>> new bugs) >>>> >>>> Just for completeness, the backstory of why I want this: >>>> My Python app writes files to disk. Sadly, some requests take more than >>>> 3 seconds. If it is killed with SIGKILL, the file buffer data is >>>> not written, resulting in a corrupted empty/truncated file. A later batch >>>> process fails when it tries to read every file in the output directory. I >>>> know there are many workarounds, such as using a temporary file and >>>> atomically renaming it, but I became curious about the root cause. >>>> The server gracefully restarts every day because of log rotation, using >>>> Ubuntu's default logrotate config. After reading #383 >>>> <https://github.com/GrahamDumpleton/mod_wsgi/issues/383> I also looked >>>> at Apache's rotatelogs >>>> <https://httpd.apache.org/docs/2.4/programs/rotatelogs.html>, but it >>>> doesn't support compression, so I'd rather stay with logrotate. >>>> >>>> Versions: Apache 2.4.41 with mpm_prefork, mod_wsgi 4.6.8 in daemon >>>> mode, Python 3.8.10, Ubuntu 20.04. (old but I don't think this matters) >>>> >>>> Tomi >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "modwsgi" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/modwsgi/CACUV5oemMwr1YzKe%3D0JrBTma%2BwQcvyaN5Jzc5uz_Kf31mK12ng%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/modwsgi/CACUV5oemMwr1YzKe%3D0JrBTma%2BwQcvyaN5Jzc5uz_Kf31mK12ng%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> >>>> >>>> -- >>> >> You received this message because you are subscribed to a topic in the >>> Google Groups "modwsgi" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/modwsgi/ZqlJLOZGb5I/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/modwsgi/6f3de9e7-d045-4b15-b771-956915c0ec32n%40googlegroups.com >>> <https://groups.google.com/d/msgid/modwsgi/6f3de9e7-d045-4b15-b771-956915c0ec32n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "modwsgi" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/modwsgi/ZqlJLOZGb5I/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/modwsgi/e562c850-192e-49b9-8207-b67ba4f6b027n%40googlegroups.com > <https://groups.google.com/d/msgid/modwsgi/e562c850-192e-49b9-8207-b67ba4f6b027n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "modwsgi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/CACUV5odSMNVA364wtu_uef1Y343BtCxUkSfsMX3ERE95gb3mVw%40mail.gmail.com.
