On Wed Jun 24, 2026 at 3:36 PM JST, Eliot Courtney wrote: > On Wed Jun 24, 2026 at 1:30 PM JST, Alexandre Courbot wrote: >> On Tue Jun 23, 2026 at 2:40 PM JST, Eliot Courtney wrote: >>> On Tue Jun 23, 2026 at 12:04 AM JST, Alexandre Courbot wrote: >>>> On Mon Jun 22, 2026 at 4:57 PM JST, Eliot Courtney wrote: >>>>> On Fri Jun 19, 2026 at 10:42 PM JST, Alexandre Courbot wrote: >>>>>> The next patch aims at replacing the cumbersome `BootUnloadGuard` with a >>>>>> more local and less intrusive mechanism to run the GSP unload sequence >>>>>> upon GSP boot failure. Doing so requires running the boot code in a >>>>>> local closure, which changes its indentation and would make other >>>>>> changes difficult to track in the diff. Thus, this preparatory patch >>>>>> moves said boot code into a local closure that is run upon construction, >>>>>> so the next patch does not need to re-indent code that changes. >>>>>> >>>>>> This is a mechanical preparatory patch to make the next patch easier to >>>>>> read. No functional change intended. >>>>>> >>>>>> Signed-off-by: Alexandre Courbot <[email protected]> >>>>>> --- >>>>> >>>>> I agree with removing BootUnloadGuard, but I think it's not great to do >>>>> a bunch of lifting into closures then manually handling the result. It's >>>>> error prone imo (we already had several bugs relating to this kind of >>>>> thing). Instead, what about just using ScopeGuard directly? This lets us >>>>> avoid lifting into closures (which is a bit noisy) and avoids manual >>>>> result handling for failures (which is a bit error prone). With the >>>>> `GspBootContext` it's fairly easy to do now: >>>>> >>>>> ``` >>>>> let unload_guard = ScopeGuard::new_with_data(unload_bundle, >>>>> |unload_bundle| { >>>>> let _ = gsp.unload(ctx, unload_bundle); >>>>> }); >>>>> ``` >>>> >>>> Yes, initially I wanted to use `ScopeGuard` but ran into issues when >>>> trying to make the references mutable. However it looks like you have >>>> been able to overcome these issues, thanks for taking the time to craft >>>> a patch! >>>> >>>>> >>>>> I confirmed that it's also compatible with the v2 of this series that >>>>> has the mutable Fsp - you can stash the context inside the ScopeGuard >>>>> data (then making a &mut reference to the stashed context for brevity) >>>>> or have a separate unload context type that doesn't use FSP or something >>>>> (could later be type parametrized along with Gsp, for example). >>>>> >>>>> For example here is a rough diff on top of this patch series (you can >>>>> change the Result<Option<UnloadBundle>> returns to like >>>>> Result<Result<UnloadBundle>> if you want to centralise teh error >>>>> handling of a failed unloadbundle although currently it can only fail in >>>>> one location): >>>> >>>> Yes, looking at it it looks like a cleaner approach than using closures. >>>> >>>> The only thing that I saw as a regression is that now each HAL needs to >>>> call `Gsp::unload` itself in its own `ScopeGuard`. I don't think that's >>>> the HAL's work - `Gsp::boot` should be the centralized point where this >>>> happens. >>>> >>>> But we can make both work if `unload_bundle` is passed as an output >>>> argument to `GspHal::boot` instead of being returned: >>>> >>>> let mut guard = ScopeGuard::new_with_data((ctx, None), ...); >>>> let (ctx, unload_bundle) = &mut *guard; >>>> >>>> // `unload_bundle` is a mutable reference to the >>>> // `Option<UnloadBundle>` in `guard`. >>>> hal.boot(&self, ctx, unload_bundle, &fb_layout, &wpr_meta)?; >>>> >>>> The boot method can fill `unload_bundle` early, and if it returns an >>>> error then the `ScopeGuard` will be able to use it. Also, and that's >>>> nice, the HALs don't need to use `ScopeGuard`. But output arguments >>>> aren't really something we expect to see in Rust. >>>> >>>> Another alternative would be to separate the unload bundle construction >>>> from `GspHal::boot`: >>>> >>>> let unload_bundle = hal.build_unload_bundle(ctx, ...); >>>> let guard = ScopeGuard::new_with_data((ctx, unload_bundle), ...); >>>> >>>> hal.boot(...)?; >>>> >>>> It removes the "1 boot -> 1 unload bundle" symmetry, but on the other >>>> hand it also splits concerns more clearly. And it removes the awkward >>>> return type of `GspHal::boot`, which come to think of it was another >>>> smell that things were not in the right place. >>>> >>>> The main drawback I see is that we now need to build `Vbios` twice for >>>> TU102, since it is needed both for the unload bundle and for booting. >>>> But the solution is the same as what the v2 of this series does to >>>> `Fsp`: the BIOS is a GPU-wide subdevice that is likely to be used >>>> elsewhere, not something to be confined in a GSP HAL. So I say, let's >>>> extract it and make it also part of `GspBootContext`. >>>> >>>> How does that sound? >>> >>> I think that currently it's confusing because we have two concepts in >>> use with very similar names. We have "unloadbundle" meaning what we need >>> to run to unload the driver, and we have "unload_guard" which is for >>> running unwind stuff if we error. And for the most part these things are >>> the same, but they might not be (e.g. in my other series where we need >>> to keep certain DMA allocations alive just for the error path, but not >>> for an unload later, or when we are partially loaded). So it might be >>> nice to make these two things more separate. >>> >>> I think that trying to build the unload bundle separately >>> (`build_unload_bundle`) is confusing because e.g. hal.boot() still needs >>> to handle unwinding its state in case of error, so it adds a strong >>> assumption that unloading in success is the same as unwinding in error. >> >> I don't think `hal.boot()` can unwind its state properly - or I should >> say, I don't think it can unwind it better than just running its unload >> bundle itself. >> >> There is little point calling `Gsp::unload` from a HAL failure as it >> just queues an `UnloadGuestDriver` message (that the GSP wouldn't be in >> condition to process) before running the unload bundle. So here we can >> simply skip the guaranteed message timeout and run the unload bundle (or >> the relevant part) directly. But even doing so doesn't guarantee that we >> can recover. >> >> For TU102, the unload bundle runs two firmwares to reset the GSP falcon >> to its original state and to remove the WPR2 region. I don't think we >> can/should run them selectively, as they don't undo work in the same >> order as the boot sequence - or even from the same microcontroller! For >> instance, FWSEC-FRTS (GSP falcon) sets up the WPR2 region, but it is >> tore down by Booter Unload (SEC2 falcon). So the best we can do if >> something fails on the boot path is run the whole unload bundle and hope >> that we can recover. >> >> For GH100, the unload bundle just waits for the GSP falcon to get out of >> RISC-V mode, as a consequence of the `UnloadingGuestDriver` GSP command >> sent by `Gsp::unload`. The problem is, that in case of failure that >> command cannot even be sent as the GSP is not up! :) So the unload >> bundle will either succeed immediately (if we failed early) or timeout >> (if the GSP is already in RISC-V mode and stuck somewhere else). It's >> actually probably better to not even try to run it in this case, and >> just wait for the GSP lockdown release to try and address the DMA buffer >> lifetime you raised. >> >> Which supports your case that the HALs should be be responsible for >> their own unwinding - only not by running `Gsp::unload` IMHO. >> >> In any case, there doesn't seem to be a recovery path that we can >> execute reliably in case of GSP boot failure. So making the HALs >> responsible for recovering does indeed sound like the safest solution >> (with TU102 trying to run the unload bundle, and GH100 doing... I'm not >> sure what :)) and should address the issues you raised - but please let >> me know if I missed something. > > Ok I think you are right w.r.t. it might be simpler to not have > significantly staged unwinding logic in the HALs. That said, IIUC, in > principle, it could unwind a bit better than the unload bundle. For > example, if we fail before we send the CoT message, then we don't need > to do anything to reset. If we fail after a CoT response, then we can > wait for GSP to halt and then try a falcon level reset if that fails. If > we fail sending the CoT message, then perhaps it needs a FLR to be sure > that things are properly reset - I think these are some options for > trying to ensure reset on GH100+. > > I think we could centralise a lot of this reset logic, e.g. overall it > would be send guest unload, wait for halt, then start trying various > resets if GSP doesn't halt. This could be some HAL methods I suppose, > and Gsp::unload just deals with running the completely common Gsp guest > unload + calling the HAL bundle. > > So that fits with keeping Gsp::unload out of HAL modules, but also > keeping the ScopeGuard run unload bundle in HAL boot() etc so we can > allow HALS to e.g. keep resources alive until GSP is definitely reset. > Then in a follow up we can harden the reset process to handle more > cases. > > WDYT?
Sounds good, v3 will do the following, which I think aligns with the result of the discussion: - Anything error happening in `GspHal::boot` is the responsibility of the HAL to handle, - After `GspHal::boot` has returned, `Gsp::unload` can be called safely so `Gsp::boot` does that using a `ScopeGuard`, - No closures. :) For now the error handling that HALs do is rather primitive (TU102 runs its unload bundle, GH100 waits for the GSP lockdown release) and I am not sure that there is room for improvement, but if there is the layout should let us address it. FLR can also eventually be handled at a higher layer if need be.
