Re: [PATCH v3 3/7] livepatch: Support scoped atomic replace using replace_set

Yafang Shao Thu, 18 Jun 2026 02:20:50 -0700

On Wed, Jun 17, 2026 at 10:55 PM Joe Lawrence <[email protected]> wrote:
>
> On Wed, Jun 17, 2026 at 10:40:50AM +0800, Yafang Shao wrote:
> > On Wed, Jun 17, 2026 at 4:15 AM Joe Lawrence <[email protected]> 
> > wrote:
> > >
> > > On Thu, Jun 11, 2026 at 02:58:39PM +0200, Petr Mladek wrote:
> > > > On Tue 2026-06-09 18:00:55, Petr Mladek wrote:
> > > > > On Sun 2026-06-07 21:16:55, Yafang Shao wrote:
> > > > > I would write something like:
> > > > >
> > > > > <proposal>
> > > > > The practice shows that the current semantic of the patch.replace 
> > > > > flag is
> > > > > not ideal.
> > > > >
> > > > > The atomic replace is disabled by default. And the no-replace mode 
> > > > > allows
> > > > > wild installation of many livepatches in parallel. The author and
> > > > > administrator are fully responsible for preventing problems caused
> > > > > by producing and installing incompatible livepatches.
> > > > >
> > > > > The most safe atomic replace mode must be explicitly enabled by
> > > > > setting "patch.replace = true". It is all or nothing. The livepatch
> > > > > with enabled .replace will always replace all already installed
> > > > > livepatches. It makes it very safe but it might be too harsh.
> > > > >
> > > > > Improve the situation by switching "bool .replace" flag to
> > > > > "u32 .replace_set" and and updating its semantic.
> > > > >
> > > > > Any .replace_set value might be associated with a set of livepatched
> > > > > symbols, callbacks, shadow variable and state IDs.
> > > > >
> > > > > A livepatch with a particular .replace_set number will atomically
> > > > > rreplace any already installed livepatch with the same .replace_set
> > > > > number. By definition, there can only ever be one active livepatch
> > > > > for a given replace_set number.
> > > > >
> > > > > On the contrary, livepatches with a different .replace_set number
> > > > > must not modify the same function, or use the state with the same
> > > > > ID [*]. Any attempt to load an incompatible livepatch will be
> > > > > rejected.
> > > > >
> > > > > Summary:
> > > > >
> > > > > The most safe mode when any livepatch replaces any other livepatch
> > > > > will be the default. Note that all livepatches must keep
> > > > > .replace_set = 0.
> > > > >
> > > > > It will be possible to install more livepatches in parallel by
> > > > > using different .replace_set numbers. The livepatches might be
> > > > > updated independently using the atomic replace feature as long
> > > > > as the new version does not break compatibility. The kernel will
> > > > > reject a livepatch from a different replace set when it would
> > > > > want to modify the same function or livepatch state from
> > > > > another replace set.
> > > > >
> > > > > [*] The compatibility check of callbacks and shadow variables will
> > > > >     be improved later by reworking their semantic. There is a work
> > > > >     in progress, see [0]
> > > > > </proposal>
> > > > >
> > > > > > Link: 
> > > > > > https://github.com/pmladek/linux/tree/klp-state-transfer-v1-iter12 
> > > > > > [0]
> > > > >
> > > > > I have realized that I actually sent "v1-iter12" to the public
> > > > > mailing list as the official v1. So we could use:
> > > > >
> > > > > Link: 
> > > > > https://lore.kernel.org/all/[email protected]/ 
> > > > > [0]
> > > > >
> > > > >
> > > > > New idea:
> > > > >
> > > > > I have briefly discussed the new semantic with Miroslav when I met
> > > > > him in person. And he was a bit concerned. We as an OS distributor
> > > > > might want to be sure that our livepatches can be installed the most
> > > > > safe way. So, we still might want to preserve the "replace all"
> > > > > semantic to make sure that our livepatches will not break anything.
> > > >
> > > > I thought more about it and we would need some solution to preserve
> > > > the replace_all functionality.
> > > >
> > > > There were recently reported few serious 0-day vulnerabilities.
> > > > We discussed a possibility to ship a quick fix with a livepatch.
> > > > Or that customers might want to fix it themself by a livepatch.
> > > > Such a livepatch would need to be installed in parallel to
> > > > the official livepatch fixing older bugs. But the next official
> > > > cumulative livepatch would need to replace it.
> > > >
> > > > The above scenario will not longer work with the current
> > > > "replace_set" handling. The hotfix would need to use another
> > > > "replace_set" so that it can be installed in parallel.
> > > > But the next cumulative livepatch won't be able to replace
> > > > it because it would need to modify the same function.
> > > >
> > > > I consulted this with AI (claude-sonet-4.6) and it gave the following
> > > > feedback/ideas ;-)
> > > >
> > > > > I though about 4 approaches approaches:
> > > > >
> > > > > 1. Make .replace_set=0 special so that it will always replace
> > > > >    everything. Similar to the current .replace=true mode.
> > > > >
> > > > >    Customers will still be able to install custom livepatches
> > > > >    later with .replace_set != 0. But the "0" livepatch will
> > > > >    always wipe them out.
> > > >
> > > > This is not ideal because it is asymetric. Why is "0" special?
> > > >
> > >
> > > Hah, why is zero special?  Because we said so and the asymmetry is the
> > > point. :)  On my first pass through this patchset and reply chain, I'd
> > > say I lean toward approach (1) as it's dead simple and means not
> > > participating in replace_set values = no functional changes for the
> > > former atomic-replace user ...
> >
> > Making zero a special case might reintroduce the issues with
> > cumulative and non-cumulative patches. See the detailed example below.
> >
> > >
> > > >
> > > > > 2. Use two flags in the livepatch, for example
> > > > >
> > > > >      a. Rename .replace to .replace_all. The livepatch with this
> > > > >     flag set will always wipe all other livepatches.
> > > > >
> > > > >      b. Add .replace_set which will allow to install more livepatches
> > > > >     in parallel, replace the livepatches with the same .replace_set
> > > > >     atomically, and check the compatibility. As described above.
> > > > >
> > > > >     It is a bit more complicated. But it is more compatible with
> > > > >     the current state. And it removes the special meaning of
> > > > >     .replace_set == 0.
> > > >
> > > > This looks more straightforward. But the fact that "replace_all"
> > > > replaces everything brings back the problem with the original
> > > > "replace" flag. So, it makes this whole exercise more or less
> > > > pointless.
> > > >
> > > > I had another idea with storing list of fixed bugs/CVEs in each
> > > > livepatch. Independent fixes might be fixed by independent
> > > > livepatches. Then a cumulative livepatch would replace only
> > > > the livepatches which fixed the same bugs before.
> > > >
> > > > And (claude-sonnet-4.6) came with an interesting simplification.
> > > >
> > > > We could add:
> > > >
> > > > struct klp_patch {
> > > > [...]
> > > >       unsigned int replace_set;
> > > >       const unsigned int *supersedes;   /* Zero terminated array of 
> > > > replace_set IDs */
> > > > [...]
> > > > }
> > > >
> > > > So that the cumulative livepatch might optionally define
> > > > another "replace_set"s which would be replaced.
> > > >
> > > > This would work well when both cumulative livepatches and the hotfix
> > > > are provided by the same vendor or group.
> > > >
> > > > We could also allow to change it dynamically by adding an module
> > > > option to the cumulative livepatch, .e.g supersedes=id[,id]*
> > > > We could add some support into the kernel for handling the module
> > > > parameter a standard way.
> > > >
> > > > It is not trivial. But it is also not horribly complex.
> > > > It looks like a good compromise between the requirements and
> > > > code complexity.
> > > >
> > > > We really need input from others here.
> > > >
> > >
> > > I'm not against supercedes functionality, but continuing the
> > > brainstorming: what about solution 1 (.replace_set=0 special) with a
> > > special zero-day overlay?
> > >
> > > The model becomes:
> > >
> > > - replace_set: isolation sets (as Yafang has implemented)
> > > - overlay (bool): "I'm a partial addition to my set, not a full 
> > > replacement"
> > >
> > > and then the vendor zero-day scenario looks like:
> > >
> > >   Mon: cumulative patch (set 0, overlay=false)
> > >   Tue: hotfix (set 0, overlay=true) stacks on top, overrides one function
> >
> > At this point, if the user reboots the machine, the loading order of
> > these livepatches becomes undetermined. If the hotfix is loaded first,
> > the cumulative patch loaded next will replace it. As a result, the
> > user must maintain the load order of these livepatches, which can be
> > quite painful.
> >
>
> [ Edit: Feel free to jump to the bottom.  Having been on PTO + holiday,
>   I needed to run through the full thought experiement below to come to
>   the same conclusion as Petr and Yafang.  Leaving it here in case it
>   helps anyone else trace through the problem. ]
>
> Ah right, this overlay idea drives us right back into stack_order
> headache.
>
> Now to take the same scenario to the supersede feature.  If I understand
> correctly, the idea is that cumulative vendor patches roll out at some
> interval (weekly, monthly, etc.) and they live in set=0.  Emergency CVE
> firedrill ensues and to expedite the fix, the vendor skips the long
> cumulative build/QE/etc. with a targeted hotfix that lives in set=1.  A
> while later, the vendor releases a full cumulative update, set=0 and
> supersedes=1 to replace the temporary hotfix(es):
>
>   Mon: cumulative base patch                      (set=0)
>   Tue: hotfix v1 disable vulnerable code          (set=1)
>   Wed: hotfix v2 vendor-specific attempt to solve (set=1)
>   Thu: cumulative patch with final CVE fix        (set=0, supersedes=1)
>
> If Wednesday's hotfix v2 removes hotfix v1 from disk, then rebooting
> before Thursday's cumulative is safe regardless of load order, as set=0
> and set=1 coexist independently.  If stale hotfix versions remain on
> disk however, same-set replacement within set=1 means the last one
> loaded wins, which may silently downgrade the fix.  With that packaging
> detail in place, this scenario looks good.
>
> One additional note is that supersedes is only a mechanism to replace
> the hotfixes.  Hotfixes still adhere to replace set rules, so if a 0-day
> lands in a function the current cumulative already patches, the hotfix
> can't be deployed as a separate set at all.  The vendor is forced to
> rebuild the full cumulative.
>
> > >   Wed: new cumulative (set 0, overlay=false) replaces both
> > >
> > > If overlay patches are cumulative, then it should support iterating on
> > > zero-day fixes like:
> > >
> > >   Mon: cumulative base patch
> > >   Tue: hotfix v1 disable vulnerable code
> > >   Wed: hotfix v2 vendor-specific attempt to solve
> > >   Thu: cumulative patch with final CVE fix
> > >
> > > So I think either the supercedes or overlay feature handle vendor-only
> > > scenarios well.
> > >
> > > The big difference overlay has from supercede is that it intentionally
> > > only plays within the vendor replace-set space.  So if a (customer)
> > > feature replace-set was off touching function_foo() and a CVE landed
> > > there, the overlay feature would remain blocked from patching it.
> > > Supercede provides a big hammer here.
> > >
> > > That said, blind eviction via supersede assumes the customer's
> > > replace-set patches are actually safe to bounce.  The customer's patch
> > > may have allocated shadow variables, modified system state via
> > > callbacks, or changed data structure semantics, all designed to be
> > > unwound by the next customer version of that patch, not by an unrelated
> > > vendor patch.  The vendor can't know what semantic landmines the
> > > customer's patch left behind, and the kernel can't validate that at load
> > > time.
> > >
>
> Now it gets complicated and I've got a fresh cup of tea to consider how
> supersede coexists with user livepatch replace sets.
>
> Scenario 1: 0-day in foo(), nobody patches it yet:
>
>   Mon: cumulative base patch                      (set=0)
>        customer feature patch                     (set=2)
>   Tue: hotfix v1 disable vulnerable code          (set=1)
>   Wed: hotfix v2 vendor-specific attempt to solve (set=1)
>   Thu: cumulative patch with final CVE fix        (set=0, supersedes=1)
>
> Life is great, the customer feature livepatch sits out in set=2,
> unaffected by all the vendor cumulative (set=0) and hotfix (set=1)
> churn.
>
> Scenario 2: 0-day in bar(), customer's set owns it:
>
>   Mon: cumulative base patch                      (set=0)
>        customer feature patch                     (set=2)
>   Tue: hotfix v1 patches bar()                    (set=1)  < REJECTED, bar() 
> owned by set=2
>
> The kernel rejects the vendor's hotfix because the customer set=2 owns
> bar() and it doesn't supersede it.  <Ahah moment> if supersedes is
> provided as a vendor livepatch module parameter, the educated customer
> could then choose the hammer and let their vendor's livepatch replace
> their bar().
>
> [ Edit: Joe is finally up to speed here. ]
>
> Alright I think I'm onboard with the supersede feature and its optional
> module parameter.
>
> With that, is it worth documenting a convention for replace_set
> allocation?  Something as simple as "vendors use low-numbered sets,
> customers use higher ones" might help avoid collisions, with the
> understanding that the kernel makes no distinction between them.


Since replace_sets can be configured dynamically as needed, I don't
think we need to document it explicitly. Users can simply choose the
right sets for their production environment—for example, by checking
the currently used sets and adjusting accordingly.

-- 
Regards
Yafang

Re: [PATCH v3 3/7] livepatch: Support scoped atomic replace using replace_set

Reply via email to