Re: [PATCH 3/6] gpu: nova-core: gsp: move boot code into local closure

Alexandre Courbot Mon, 29 Jun 2026 00:49:41 -0700

On Wed Jun 24, 2026 at 3:36 PM JST, Eliot Courtney wrote:
> On Wed Jun 24, 2026 at 1:30 PM JST, Alexandre Courbot wrote:
>> On Tue Jun 23, 2026 at 2:40 PM JST, Eliot Courtney wrote:
>>> On Tue Jun 23, 2026 at 12:04 AM JST, Alexandre Courbot wrote:
>>>> On Mon Jun 22, 2026 at 4:57 PM JST, Eliot Courtney wrote:
>>>>> On Fri Jun 19, 2026 at 10:42 PM JST, Alexandre Courbot wrote:
>>>>>> The next patch aims at replacing the cumbersome `BootUnloadGuard` with a
>>>>>> more local and less intrusive mechanism to run the GSP unload sequence
>>>>>> upon GSP boot failure. Doing so requires running the boot code in a
>>>>>> local closure, which changes its indentation and would make other
>>>>>> changes difficult to track in the diff. Thus, this preparatory patch
>>>>>> moves said boot code into a local closure that is run upon construction,
>>>>>> so the next patch does not need to re-indent code that changes.
>>>>>>
>>>>>> This is a mechanical preparatory patch to make the next patch easier to
>>>>>> read. No functional change intended.
>>>>>>
>>>>>> Signed-off-by: Alexandre Courbot <[email protected]>
>>>>>> ---
>>>>>
>>>>> I agree with removing BootUnloadGuard, but I think it's not great to do
>>>>> a bunch of lifting into closures then manually handling the result. It's
>>>>> error prone imo (we already had several bugs relating to this kind of
>>>>> thing). Instead, what about just using ScopeGuard directly? This lets us
>>>>> avoid lifting into closures (which is a bit noisy) and avoids manual
>>>>> result handling for failures (which is a bit error prone). With the
>>>>> `GspBootContext` it's fairly easy to do now:
>>>>>
>>>>> ```
>>>>> let unload_guard = ScopeGuard::new_with_data(unload_bundle, 
>>>>> |unload_bundle| {
>>>>>     let _ = gsp.unload(ctx, unload_bundle);
>>>>> });
>>>>> ```
>>>>
>>>> Yes, initially I wanted to use `ScopeGuard` but ran into issues when
>>>> trying to make the references mutable. However it looks like you have
>>>> been able to overcome these issues, thanks for taking the time to craft
>>>> a patch!
>>>>
>>>>>
>>>>> I confirmed that it's also compatible with the v2 of this series that
>>>>> has the mutable Fsp - you can stash the context inside the ScopeGuard
>>>>> data (then making a &mut reference to the stashed context for brevity)
>>>>> or have a separate unload context type that doesn't use FSP or something
>>>>> (could later be type parametrized along with Gsp, for example).
>>>>>
>>>>> For example here is a rough diff on top of this patch series (you can
>>>>> change the Result<Option<UnloadBundle>> returns to like
>>>>> Result<Result<UnloadBundle>> if you want to centralise teh error
>>>>> handling of a failed unloadbundle although currently it can only fail in
>>>>> one location):
>>>>
>>>> Yes, looking at it it looks like a cleaner approach than using closures.
>>>>
>>>> The only thing that I saw as a regression is that now each HAL needs to
>>>> call `Gsp::unload` itself in its own `ScopeGuard`. I don't think that's
>>>> the HAL's work - `Gsp::boot` should be the centralized point where this
>>>> happens.
>>>>
>>>> But we can make both work if `unload_bundle` is passed as an output
>>>> argument to `GspHal::boot` instead of being returned:
>>>>
>>>>     let mut guard = ScopeGuard::new_with_data((ctx, None), ...);
>>>>     let (ctx, unload_bundle) = &mut *guard;
>>>>
>>>>     // `unload_bundle` is a mutable reference to the
>>>>     // `Option<UnloadBundle>` in `guard`.
>>>>     hal.boot(&self, ctx, unload_bundle, &fb_layout, &wpr_meta)?;
>>>>
>>>> The boot method can fill `unload_bundle` early, and if it returns an
>>>> error then the `ScopeGuard` will be able to use it. Also, and that's
>>>> nice, the HALs don't need to use `ScopeGuard`. But output arguments
>>>> aren't really something we expect to see in Rust.
>>>>
>>>> Another alternative would be to separate the unload bundle construction
>>>> from `GspHal::boot`:
>>>>
>>>>     let unload_bundle = hal.build_unload_bundle(ctx, ...);
>>>>     let guard = ScopeGuard::new_with_data((ctx, unload_bundle), ...);
>>>>
>>>>     hal.boot(...)?;
>>>>
>>>> It removes the "1 boot -> 1 unload bundle" symmetry, but on the other
>>>> hand it also splits concerns more clearly. And it removes the awkward
>>>> return type of `GspHal::boot`, which come to think of it was another
>>>> smell that things were not in the right place.
>>>>
>>>> The main drawback I see is that we now need to build `Vbios` twice for
>>>> TU102, since it is needed both for the unload bundle and for booting.
>>>> But the solution is the same as what the v2 of this series does to
>>>> `Fsp`: the BIOS is a GPU-wide subdevice that is likely to be used
>>>> elsewhere, not something to be confined in a GSP HAL. So I say, let's
>>>> extract it and make it also part of `GspBootContext`.
>>>>
>>>> How does that sound?
>>>
>>> I think that currently it's confusing because we have two concepts in
>>> use with very similar names. We have "unloadbundle" meaning what we need
>>> to run to unload the driver, and we have "unload_guard" which is for
>>> running unwind stuff if we error. And for the most part these things are
>>> the same, but they might not be (e.g. in my other series where we need
>>> to keep certain DMA allocations alive just for the error path, but not
>>> for an unload later, or when we are partially loaded). So it might be
>>> nice to make these two things more separate.
>>>
>>> I think that trying to build the unload bundle separately
>>> (`build_unload_bundle`) is confusing because e.g. hal.boot() still needs
>>> to handle unwinding its state in case of error, so it adds a strong
>>> assumption that unloading in success is the same as unwinding in error.
>>
>> I don't think `hal.boot()` can unwind its state properly - or I should
>> say, I don't think it can unwind it better than just running its unload
>> bundle itself.
>>
>> There is little point calling `Gsp::unload` from a HAL failure as it
>> just queues an `UnloadGuestDriver` message (that the GSP wouldn't be in
>> condition to process) before running the unload bundle. So here we can
>> simply skip the guaranteed message timeout and run the unload bundle (or
>> the relevant part) directly. But even doing so doesn't guarantee that we
>> can recover.
>>
>> For TU102, the unload bundle runs two firmwares to reset the GSP falcon
>> to its original state and to remove the WPR2 region. I don't think we
>> can/should run them selectively, as they don't undo work in the same
>> order as the boot sequence - or even from the same microcontroller! For
>> instance, FWSEC-FRTS (GSP falcon) sets up the WPR2 region, but it is
>> tore down by Booter Unload (SEC2 falcon). So the best we can do if
>> something fails on the boot path is run the whole unload bundle and hope
>> that we can recover.
>>
>> For GH100, the unload bundle just waits for the GSP falcon to get out of
>> RISC-V mode, as a consequence of the `UnloadingGuestDriver` GSP command
>> sent by `Gsp::unload`. The problem is, that in case of failure that
>> command cannot even be sent as the GSP is not up! :) So the unload
>> bundle will either succeed immediately (if we failed early) or timeout
>> (if the GSP is already in RISC-V mode and stuck somewhere else). It's
>> actually probably better to not even try to run it in this case, and
>> just wait for the GSP lockdown release to try and address the DMA buffer
>> lifetime you raised.
>>
>> Which supports your case that the HALs should be be responsible for
>> their own unwinding - only not by running `Gsp::unload` IMHO.
>>
>> In any case, there doesn't seem to be a recovery path that we can
>> execute reliably in case of GSP boot failure. So making the HALs
>> responsible for recovering does indeed sound like the safest solution
>> (with TU102 trying to run the unload bundle, and GH100 doing... I'm not
>> sure what :)) and should address the issues you raised - but please let
>> me know if I missed something.
>
> Ok I think you are right w.r.t. it might be simpler to not have
> significantly staged unwinding logic in the HALs. That said, IIUC, in
> principle, it could unwind a bit better than the unload bundle. For
> example, if we fail before we send the CoT message, then we don't need
> to do anything to reset. If we fail after a CoT response, then we can
> wait for GSP to halt and then try a falcon level reset if that fails. If
> we fail sending the CoT message, then perhaps it needs a FLR to be sure
> that things are properly reset - I think these are some options for
> trying to ensure reset on GH100+.
>
> I think we could centralise a lot of this reset logic, e.g. overall it
> would be send guest unload, wait for halt, then start trying various
> resets if GSP doesn't halt. This could be some HAL methods I suppose,
> and Gsp::unload just deals with running the completely common Gsp guest
> unload + calling the HAL bundle.
>
> So that fits with keeping Gsp::unload out of HAL modules, but also
> keeping the ScopeGuard run unload bundle in HAL boot() etc so we can
> allow HALS to e.g. keep resources alive until GSP is definitely reset.
> Then in a follow up we can harden the reset process to handle more
> cases.
>
> WDYT?


Sounds good, v3 will do the following, which I think aligns with the
result of the discussion:

- Anything error happening in `GspHal::boot` is the responsibility of
  the HAL to handle,
- After `GspHal::boot` has returned, `Gsp::unload` can be called safely
  so `Gsp::boot` does that using a `ScopeGuard`,
- No closures. :)

For now the error handling that HALs do is rather primitive (TU102 runs
its unload bundle, GH100 waits for the GSP lockdown release) and I am
not sure that there is room for improvement, but if there is the layout
should let us address it.

FLR can also eventually be handled at a higher layer if need be.

Re: [PATCH 3/6] gpu: nova-core: gsp: move boot code into local closure

Reply via email to