Re: [PATCH 11/11] gpu: nova-core: add PIO support for loading firmware images

Alexandre Courbot Wed, 19 Nov 2025 05:49:58 -0800

On Wed Nov 19, 2025 at 1:28 PM JST, Alexandre Courbot wrote:
> On Sat Nov 15, 2025 at 8:30 AM JST, Timur Tabi wrote:
>> Turing and GA100 use programmed I/O (PIO) instead of DMA to upload
>> firmware images into Falcon memory.
>>
>> A new firmware called the Generic Bootloader (as opposed to the
>> GSP Bootloader) is used to upload FWSEC.
>>
>> Signed-off-by: Timur Tabi <[email protected]>
>> ---
>>  drivers/gpu/nova-core/falcon.rs         | 181 ++++++++++++++++++++++++
>>  drivers/gpu/nova-core/firmware.rs       |   4 +-
>>  drivers/gpu/nova-core/firmware/fwsec.rs | 112 ++++++++++++++-
>>  drivers/gpu/nova-core/gsp/boot.rs       |  10 +-
>>  4 files changed, 299 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/nova-core/falcon.rs 
>> b/drivers/gpu/nova-core/falcon.rs
>> index 7af32f65ba5f..f9a4a35b7569 100644
>> --- a/drivers/gpu/nova-core/falcon.rs
>> +++ b/drivers/gpu/nova-core/falcon.rs
>> @@ -20,6 +20,10 @@
>>  use crate::{
>>      dma::DmaObject,
>>      driver::Bar0,
>> +    firmware::fwsec::{
>> +        BootloaderDmemDescV2,
>> +        GenericBootloader, //
>> +    },
>>      gpu::Chipset,
>>      num::{
>>          FromSafeCast,
>> @@ -400,6 +404,183 @@ pub(crate) fn reset(&self, bar: &Bar0) -> Result {
>>          Ok(())
>>      }
>>  
>> +
>> +    /// See nvkm_falcon_pio_wr - takes a byte array instead of a 
>> FalconFirmware
>> +    fn pio_wr_bytes(
>> +        &self,
>> +        bar: &Bar0,
>> +        source: *const u8,
>> +        mem_base: u16,
>> +        length: usize,
>
> We will definitely want to combine `source` and `length` into a
> convenient `&[u8]`. Now I understand why you used a pointer here,
> because we need to write an instance of `BootloaderDmemDescV2`, and also
> because we use data from a `CoherentAllocation`.
>
> The first one is easy to fix: `BootloaderDmemDescV2` is just a bunch of
> integers, so you can implement `AsBytes` on it and get a nice slice of
> bytes exactly as we want.
>
>> +        target_mem: FalconMem,
>> +        port: u8,
>> +        tag: u16
>> +    ) -> Result {
>> +        // To avoid unnecessary complication in the write loop, make sure 
>> the buffer
>> +        // length is aligned.  It always is, which is why an assertion is 
>> okay.
>> +        assert!((length % 4) == 0);
>
> Let's return an error instead of panicking here.
>
>> +
>> +        // From now on, we treat the data as an array of u32
>> +
>> +        let length = length / 4;
>> +        let mut remaining_len: usize = length;
>> +        let mut img_offset: usize = 0;
>> +        let mut tag = tag;
>> +
>> +        // Get data as a slice of u32s
>> +        let img = unsafe {
>> +            core::slice::from_raw_parts(source as *const u32, length)
>> +        };
>> +
>> +        match target_mem {
>> +            FalconMem::ImemSec | FalconMem::ImemNs => {
>> +                regs::NV_PFALCON_FALCON_IMEMC::default()
>> +                    .set_secure(target_mem == FalconMem::ImemSec)
>> +                    .set_aincw(true)
>> +                    .set_offs(mem_base)
>> +                    .write(bar, &E::ID, port as usize);
>> +            },
>> +            FalconMem::Dmem => {
>> +                // gm200_flcn_pio_dmem_wr_init
>
> Probably a stray development-time comment.
>
>> +                regs::NV_PFALCON_FALCON_DMEMC::default()
>> +                    .set_aincw(true)
>> +                    .set_offs(mem_base)
>> +                    .write(bar, &E::ID, port as usize);
>> +            },
>> +        }
>> +
>> +        while remaining_len > 0 {
>> +            let xfer_len = core::cmp::min(remaining_len, 256 / 4); // 
>> pio->max = 256
>> +
>> +            // Perform the PIO write for the next 256 bytes.  Each tag 
>> represents
>> +            // a 256-byte block in IMEM/DMEM.
>> +            let mut len = xfer_len;
>> +
>> +            match target_mem {
>> +                FalconMem::ImemSec | FalconMem::ImemNs => {
>> +                    regs::NV_PFALCON_FALCON_IMEMT::default()
>> +                        .set_tag(tag)
>> +                        .write(bar, &E::ID, port as usize);
>> +
>> +                    while len > 0 {
>> +                        regs::NV_PFALCON_FALCON_IMEMD::default()
>> +                            .set_data(img[img_offset])
>> +                            .write(bar, &E::ID, port as usize);
>> +                        img_offset += 1;
>> +                        len -= 1;
>> +                    };
>> +
>> +                    tag += 1;
>> +                },
>> +                FalconMem::Dmem => {
>> +                    // tag is ignored for DMEM
>> +                    while len > 0 {
>> +                        regs::NV_PFALCON_FALCON_DMEMD::default()
>> +                            .set_data(img[img_offset])
>> +                            .write(bar, &E::ID, port as usize);
>> +                        img_offset += 1;
>> +                        len -= 1;
>> +                    };
>> +                },
>> +            }
>> +
>> +            remaining_len -= xfer_len;
>> +        }
>
> Let's turn this C-style loop into something more Rustey.
>
> We want to divide the input twice: once in 256 bytes block to write the
> Imem tag if needed, and then again in blocks of `u32`. Nova being
> little-endian, we can assume that ordering. This lets us leverage
> `chunks` and `from_bytes`.
>
> I got the following (untested) code, which assumes `source` is the
> `&[u8]` we want to write:
>
>     // Length of an IMEM tag in bytes.
>     const IMEM_TAG_LEN: usize = 256;
>
>     for chunk in source.chunks(IMEM_TAG_LEN) {
>         // Convert our chunk of bytes into an array of u32s.
>         //
>         // This can never fail as the sizes match, but propagate the error
>         // to avoid an `unsafe` statement.
>         let chunk = <[u32; IMEM_TAG_LEN / 
> size_of::<u32>()]>::from_bytes(chunk)?;


Wait, that will fail on the last chunk unless the input size is a
multiple of 256. But you can replace that last line with

    let chunk = chunk.chunks_exact(size_of::<u32>()).map(|word| {
        // This `unwrap` cannot fail because `chunks_exact` guarantees that 
`word` is the
        // size of a `u32`.
        let word: [u8; 4] = word.try_into().unwrap();
        u32::from_le_bytes(word)
    });

and it should be good.

You'll also need to change `for &data in chunk` into `for data in chunk`
in the code that follows.

Re: [PATCH 11/11] gpu: nova-core: add PIO support for loading firmware images

Reply via email to