Re: [OpenWrt-Devel] Sysupgrade and Failed to kill all processes

Reiner Karlsberg Tue, 12 May 2020 23:08:18 -0700

Applause, applause.

The first (partial) docs of the magic of sysupgrade. And its pitfalls.


Having had various issues with sysupgrade myself in the past (also doing 
sysupgrade OTA), I add following notes:
- Having open files on storage devices (i.e. for swap, but also explicitly 
opened) broke sysupgrade for me.
- No real error-feedback, in case sysupgrade was _not_ done. Even leaving the 
filesystem in inconsistent state,
as "sysupgrade ... -f myfilestobesaved.tar.gz" was applied to (still) running 
image, without upgrading to new firmware.bin

Regarding killing the processes in a 10-times loop, in addition of a short 
sleep in every iteration,
may be also to check for "process still alive".

Having read your mail, I am happy, that for some time already I explicitly do 
the killing of processes myself,
before sysupgrade. Especially, in case I have non-standard programs running, 
like nginx or squid.
As the default config of squid defines a 10s duration for shutdown.


Am 13.05.2020 um 08:17 schrieb Michael Jones:

I've been investigating a problem with sysupgrade failing with the error message "Failed to kill all processes", andthen hanging indefinitely.
This happens maybe once every 10-20 sysupgrades, and it's kind of a pain.
So far I've determined this workflow that the sysupgrade command follows. Note, I'm not aiming for 100% accuracy, butjust broad strokes.
1) /sbin/sysupgrade locates the file to upgrade from on the filesystem, or if the second option to sysupgrade startswith http://, it downloads the firmware file using wget.2) /sbn/sysupgrade does some minor validation of various things, and grabs whatever config files it thinks the end userwants to be restored and packs them up into some kind of tarball.
3) sysupgrade sends a message, via ubus, to procd, to initiate the upgrade.
4) Procd does some stuff which I haven't finished completely understanding just yet, but it looks like firmwareverification to make sure we don't upgrade to a bad firmware file.5) It *does not* appear that procd will proactively terminate services until everything (or almost everything) is shutdown. Seems like something that should be added to increase reliability.6) procd replaces itself (execvp systemcall) with the program /sbin/upgraded. This means that procd is *no longerrunning*, PID 1 is now /sbin/upgraded. So service management is not possible at this point.
7) /sbin/upgraded now acts as PID1. It executes the shell script 
/lib/upgrade/stage2 with parameters.
8) The shell script loops on all processes, and sends them the TERM signal, and then the KILL signal. See email subjecfor problems with this.
9) the shell script creates a new ram filesystem, mounts it, then copies over a 
very small set of binaries into it.
10) The shell script changes root into the new ram filesystem
11) Inside the ramfilesystem, the shell script writes the upgraded firmware and 
saved configuration to disk
12) Reboot.


Now that the very rough summary is out of the way, I have 4 questions.
1) I notice that the shell script /lib/upgrade/stage2 is doing a tight loop with kill -9 to terminate processes.However, it's only looping a maximum of 10 times, and its going as fast as the shell can loop.
What's to stop this loop from quickly going through every process almost immediately 10 times, before a process thatwould be about to terminate terminates? The process in question may be handling some kind of IO, so the kernel wouldn'timmediately terminate it.
Shouldn't there be some very brief sleep at the end of each loop iteration to ensure that the processes that are goingto practically terminate have done so?
2) Why is the behavior on failure to terminate processes to just give up? That leaves devices hanging without anynetwork connectivity.
A reboot with some logging on disk would allow for remote sysupgrades to have 
some kind of recoverability.

3) Is looping over sigkill a reliable way to terminate all processes?
I was under the impression that the only reliable way to ensure all processes terminate is to use cgroups, and put theprocesses to terminate in the freezer group and then kill them off after they've been frozen. Otherwise you havebasically a race condition between the termination of processes and the creation of children. E.g. a fork-bomb couldprevent all processes from being terminated.
4) Why doesn't procd, prior to execvp the /sbin/upgraded program, shutdown all 
the services that are running?

Maybe I'm just not seeing where it does this, so if that's the case, then I'm 
happy to be corrected.
But I'm under the impression that when not using cgroups, stopping all services would allow for anything that isn'tdouble forked to be gracefully shutdown and cleaned up after itself.
_______________________________________________
openwrt-devel mailing list
[email protected]
https://lists.openwrt.org/mailman/listinfo/openwrt-devel



_______________________________________________
openwrt-devel mailing list
[email protected]
https://lists.openwrt.org/mailman/listinfo/openwrt-devel

Re: [OpenWrt-Devel] Sysupgrade and Failed to kill all processes

Reply via email to