Re: SWUpdate+EBG: The impossible state and how it's being handled so far

Christian Storm Tue, 21 Feb 2023 13:20:31 -0800

Hi,

> > > > playing with updates, I maneuvered the EBG envs on a system into this
> > > > weird state:
> > > > 
> > > > 
> > > > ----------------------------
> > > >    Config Partition #0 Values:
> > > > in_progress:      yes
> > > > revision:         4
> > > > kernel:           C:BOOT1:linux.efi
> > > > kernelargs:
> > > > watchdog timeout: 0 seconds
> > > > ustate:           3 (FAILED)
> > > > 
> > > > user variables:
> > > > recovery_status = failed


Hm, did you start with a clean environment and SWUpdate >= 2022.12?


> > > > ----------------------------
> > > >    Config Partition #1 Values:
> > > > in_progress:      no
> > > > revision:         3
> > > > kernel:           C:BOOT1:linux.efi
> > > > kernelargs:
> > > > watchdog timeout: 0 seconds
> > > > ustate:           2 (TESTING)
> > > > 
> > > > user variables:
> > > > 
> > > 
> > > I see - we should *never* reach this state.
> > > 
> > > > 
> > > > To get there, I started an upstate with swupdate and booted into testing
> > > > path #1.
> > > 
> > > Ok
> > > 
> > > > But then didn't confirm this update and rather started it
> > > > again, using the same swu.
> > > 
> > > It looks to me that this is the point. SWUpdate requires to close the
> > > transaction, for itself or for the deployment server (Hawkbit). If a
> > > system boots with TESTING, the glue logic should start SWUpdate asking
> > > to close the transaction - with OK or FAILED by passing the -c parameter.
> > > 
> > > However, this was thought to work together with the deployment server,
> > > because it handles the state machine on Hawkbit. The parameter is
> > > ignored if another deployment interface (Webserver, USB, ..) is used.

The suricatta modules handle this for you ― as a "convenience" feature
and to keep the (hawkBit, ...)  server's view of things consistent with
the device's, which is more important than the convenience aspect :)

If you're running it with other modules/modes, you're on your own.
Then, you have to play along the (convention) rules to close the
transaction as there's nothing preventing you to get into this
situation with EFI Boot Guard.

Hence, the valid question whether this should be allowed / denied by EFI
Boot Guard or the tools (SWUpdate in this case) making use of it?


> > > This is managed (again) on such situation on glue logic, and the
> > > transaction (that is set of ustate) is done before starting SWUpdate. Or
> > > in case of U-Boot, it is also managed with the help of additional (and
> > > custom) variables.
> > > 
> > > In your case, it seems that nothing is done at boot time, and SWUpdate
> > > is started. SWUpdate does not know (because it expects that someone has
> > > already decided, and ustate is not checked) that a new software is
> > > running, and the same SWU is loaded again.

Exactly, here you're on your own. You have to instrument EFI Boot Guard
so that it's happy... which is convention and not enforced, currently.
Granted, this requires a lot of context knowledge how to integrate
things properly and seamlessly...


> > > > That didn't complete because the UUID clash
> > > > was detected. swupdate terminated, and I was left with the above.
> > > > 
> > > > I can still boot this constellation, EBG will select path #1 (endless
> > > > testing, so to say). OTOH:
> > > > 
> > > > # bg_printenv -c
> > > > Using latest config partition
> > > > Values:
> > > > in_progress:      yes
> > > > revision:         4
> > > > kernel:           C:BOOT1:linux.efi
> > > > kernelargs:
> > > > watchdog timeout: 0 seconds
> > > > ustate:           3 (FAILED)
> > > > 
> > > > user variables:
> > > > recovery_status = failed
> > > > 
> > > > 
> > > > That is not quite correct. To be fair, bg_printenv deals with an illegal
> > > > state here.
> > > 
> > > Agree.
> > > 
> > > > Still...
> > > > 
> > > > The key question is where to avoid best entering this state in the first
> > > > place?
> > > 
> > > My question is why the transaction was not closed before running
> > > SWUpdate. This is a common pattern even with other bootloader, but it is
> > > more important here because EBG stores an history (well, with deep=1) of
> > > previous run.
> > > 
> > > SWUpdate can check the state when is running, but there is no general
> > > cases. There are use cases where the OK is coming from the application,
> > > and SWUpdate waits via IPC the result (but then SWUpdate is started with
> > > WAIT option, and does not try to load a new SWU). So SWUpdate cannot
> > > decide itself that TESTING is a wrong ustate, because it depends on a
> > > single project.

One common pattern is to have a "health" target and once that's reached
you start SWUpdate with according parameters (or set them yourself via
some glueing method). But again, that is convention, not enforced, and 
it's currently the responsibility of the system integrator to get right.


> > I was running swupdate manually from the command line. No backend
> > involved, just the desire to intentionally break things. ;)
> 
> The best way to reach the goal...:-D

If you would have used suricatta, you would have missed this :)


> And yes, this can happen because the part deciding if previous update was ok,
> is missing. In most projects, if system is up and running, it is considered
> ok. That means the decision is done in SWUpdate's systemd run unit (or SystemV
> init script), see also glue logic under /usr/lib/swupdate. In some other
> cases, update is ok only if application is running, a migration of a custom
> database was ok, ad, and....that means is outside SWUpdate. SWUpdate supports
> all these use cases.

Yes, that's the codified context knowledge. Still, if you miss out on
one thing, the whole integration will crash and burn. And it's quite
easy to miss a thing...

The question is whether there is a generic pattern like the "health"
target I sketched above so that SWUpdate can handle and abstract 
the bootloader interactions? 

Then, any SWUpdate mode/module will behave the same and there's all
in one place reducing the need for having all the context knowledge...

> To avoid the issue you are seeing, the decsion should be done inside SWUpdate:
> something like a transiction TESTING ==> OK, because SWUpdate is running. But
> as I said, this can be done if it will be configurable, or it will break the
> use cases I mentioned.

This is essentially promoting the current suricatta behavior to all
SWUpdate modes/modules w/o the remote reporting part if not run from
a suricatta module. Would be a starter...


Kind regards,
   Christian

-- 
Dr. Christian Storm
Siemens AG, Technology, T CED SES-DE
Otto-Hahn-Ring 6, 81739 München, Germany

-- 
You received this message because you are subscribed to the Google Groups "EFI 
Boot Guard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/efibootguard-dev/20230221212125.zdic7daabaa25ovk%40cosmos.

Re: SWUpdate+EBG: The impossible state and how it's being handled so far

Reply via email to