On Wed, 17 Jun 2026 09:25:07 -0400 Aaron Conole <[email protected]> wrote:
> Timothy Redaelli via dev <[email protected]> writes: > > > When ovsdb-server or ovs-vswitchd fails and auto-restarts > > (Restart=on-failure), it briefly passes through the failed/inactive > > state. This causes a cascade: the umbrella service (which Requires > > both) sees the failure and stops, which in turn stops the other > > service via PartOf. When the failed service comes back, the other > > does not automatically restart. > > > > RestartMode=direct (systemd v254+, PR systemd/systemd#27584) makes > > the service transition directly to the activating state during > > auto-restart, skipping the failed/inactive state. Dependents never > > see the failure, so the cascade does not happen. > > > > On older systemd versions the directive is silently ignored with a > > harmless journal warning ("Unknown key name 'RestartMode'"), so > > this change is safe for all supported platforms. Tested with > > containers: > > > > systemd 252 (CentOS Stream 9, Debian 12): warning, ignored > > systemd 255 (Ubuntu 24.04): recognized, clean > > systemd 256 (CentOS Stream 10): recognized, clean > > systemd 257 (Debian 13): recognized, clean > > I didn't check, but we should probably make sure that any systems where > we apply this also have: > > https://github.com/goenkam/systemd/commit/7f85fc2c31f074badcf4d517a4f84a1fd72cf909 > > applied, right? Otherwise, I think there's some kind of looped > dependency restarts when this is triggered. That commit (upstream 7a13937007, in v257+) fixes stop-job propagation to BindsTo= dependents during direct-mode restarts. OVS don't use BindsTo=, openvswitch.service uses Requires= on the sub-services, and the sub-services use PartOf=openvswitch.service. The cascade we're preventing happens because Requires= reacts to the sub-service entering the failed/inactive state. RestartMode=direct prevents that by skipping the state transition entirely, and that code path has been there since v254. > But actually, this mode should only be on Type=one-shot services I > think. If ovsdb-server experiences failure, the RestartMode=direct > shouldn't have any effect. I'm guessing based on this: > > * i.e. unit_process_job -> job_finish_and_invalidate is never called, > * and the previous job might still be running (especially for > * Type=oneshot services). > > Which seems to imply that if there's a weird failure propagated, we > might end up with too many instances of vswitchd/db-server running. RestartMode=direct is not restricted to Type=oneshot, it works with any service type. The comment you quoted says "especially for Type=oneshot services" because those have long-running ExecStart= commands that might still be in progress when a restart is attempted. Our services are Type=forking with PIDFile=. This means the restart only triggers when the main process exits (that's what Restart=on-failure reacts to), so by the time service_enter_restart() runs, the old process is already gone. There's no window where two instances coexist. Re-reading systemd service files made me think about migrating Type=forking to Type=notify to avoid useless forking + PID checking and to have a proper readiness signaling (sd_notify), but I'll do that as a follow up series (since RestartMode=direct will still be needed). > Perhaps I'm misunderstanding something. > > > Timothy Redaelli (2): > > rhel: Add RestartMode=direct to service units. > > debian: Add RestartMode=direct to service units. > > > > debian/openvswitch-switch.ovs-vswitchd.service | 1 + > > debian/openvswitch-switch.ovsdb-server.service | 1 + > > rhel/usr_lib_systemd_system_ovs-vswitchd.service.in | 1 + > > rhel/usr_lib_systemd_system_ovsdb-server.service | 1 + > > 4 files changed, 4 insertions(+) > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
