Applause, applause.
The first (partial) docs of the magic of sysupgrade. And its pitfalls.
Having had various issues with sysupgrade myself in the past (also doing
sysupgrade OTA), I add following notes:
- Having open files on storage devices (i.e. for swap, but also explicitly
opened) broke sysupgrade for me.
- No real error-feedback, in case sysupgrade was _not_ done. Even leaving the
filesystem in inconsistent state,
as "sysupgrade ... -f myfilestobesaved.tar.gz" was applied to (still) running
image, without upgrading to new firmware.bin
Regarding killing the processes in a 10-times loop, in addition of a short
sleep in every iteration,
may be also to check for "process still alive".
Having read your mail, I am happy, that for some time already I explicitly do
the killing of processes myself,
before sysupgrade. Especially, in case I have non-standard programs running,
like nginx or squid.
As the default config of squid defines a 10s duration for shutdown.
Am 13.05.2020 um 08:17 schrieb Michael Jones:
I've been investigating a problem with sysupgrade failing with the error message "Failed to kill all processes", and
then hanging indefinitely.
This happens maybe once every 10-20 sysupgrades, and it's kind of a pain.
So far I've determined this workflow that the sysupgrade command follows. Note, I'm not aiming for 100% accuracy, but
just broad strokes.
1) /sbin/sysupgrade locates the file to upgrade from on the filesystem, or if the second option to sysupgrade starts
with http://, it downloads the firmware file using wget.
2) /sbn/sysupgrade does some minor validation of various things, and grabs whatever config files it thinks the end user
wants to be restored and packs them up into some kind of tarball.
3) sysupgrade sends a message, via ubus, to procd, to initiate the upgrade.
4) Procd does some stuff which I haven't finished completely understanding just yet, but it looks like firmware
verification to make sure we don't upgrade to a bad firmware file.
5) It *does not* appear that procd will proactively terminate services until everything (or almost everything) is shut
down. Seems like something that should be added to increase reliability.
6) procd replaces itself (execvp systemcall) with the program /sbin/upgraded. This means that procd is *no longer
running*, PID 1 is now /sbin/upgraded. So service management is not possible at this point.
7) /sbin/upgraded now acts as PID1. It executes the shell script
/lib/upgrade/stage2 with parameters.
8) The shell script loops on all processes, and sends them the TERM signal, and then the KILL signal. See email subjec
for problems with this.
9) the shell script creates a new ram filesystem, mounts it, then copies over a
very small set of binaries into it.
10) The shell script changes root into the new ram filesystem
11) Inside the ramfilesystem, the shell script writes the upgraded firmware and
saved configuration to disk
12) Reboot.
Now that the very rough summary is out of the way, I have 4 questions.
1) I notice that the shell script /lib/upgrade/stage2 is doing a tight loop with kill -9 to terminate processes.
However, it's only looping a maximum of 10 times, and its going as fast as the shell can loop.
What's to stop this loop from quickly going through every process almost immediately 10 times, before a process that
would be about to terminate terminates? The process in question may be handling some kind of IO, so the kernel wouldn't
immediately terminate it.
Shouldn't there be some very brief sleep at the end of each loop iteration to ensure that the processes that are going
to practically terminate have done so?
2) Why is the behavior on failure to terminate processes to just give up? That leaves devices hanging without any
network connectivity.
A reboot with some logging on disk would allow for remote sysupgrades to have
some kind of recoverability.
3) Is looping over sigkill a reliable way to terminate all processes?
I was under the impression that the only reliable way to ensure all processes terminate is to use cgroups, and put the
processes to terminate in the freezer group and then kill them off after they've been frozen. Otherwise you have
basically a race condition between the termination of processes and the creation of children. E.g. a fork-bomb could
prevent all processes from being terminated.
4) Why doesn't procd, prior to execvp the /sbin/upgraded program, shutdown all
the services that are running?
Maybe I'm just not seeing where it does this, so if that's the case, then I'm
happy to be corrected.
But I'm under the impression that when not using cgroups, stopping all services would allow for anything that isn't
double forked to be gracefully shutdown and cleaned up after itself.
_______________________________________________
openwrt-devel mailing list
[email protected]
https://lists.openwrt.org/mailman/listinfo/openwrt-devel
_______________________________________________
openwrt-devel mailing list
[email protected]
https://lists.openwrt.org/mailman/listinfo/openwrt-devel