On Thu, 2025-12-04 at 23:43 +0900, Alexandre Courbot wrote:
> And yeah, you are abolutely right, but my point was more about not
> having the code to handle panic conditions generated. Maybe I am
> thinking too much ahead, but I dream of a future where we could make
> guarantees like "this function never panics" and have the compiler
> complain if it does. So as a matter of principle I like to avoid having
> these, especially when they cannot happen in practice.

I can totally agree with that, but I don't know how to tell the compiler that 
obj.size() will never
be close to 2^64.

The issue only occurs with RMARGS, the last entry in the LibosArgs array.  That 
is a 72-byte struct,
IIRC.  All of the other entries in the array are logging buffers that are 
already 4K aligned.

> So something like using `pr_warn` looks reasonable to me as a last
> resort.
> 
> ... or maybe we can address the problem differently. Reading your commit
> log again:
> 
>   GSP-RM insists that the 'size' parameter of the
>   LibosMemoryRegionInitArgument struct be aligned to 4KB.      
> 
> sounds to me like "it is a bug if `size` is not aligned to 4KB to begin
> with". Could that be a correct interpretation?

Short answer: yes.  I had to dig real deep into the firmware to find this 
restriction.  GSP-RM has
this check:

    #define LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL      0x1000ULL

    INIT_PANIC_IF((init_args->size & (LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL 
- 1)) != 0,
                  LibosPanicReasonInvalidConfiguration, init_args->size,
                  "Unaligned size specified");

For some reason, this code exists only in the Turing/GA100 version of GSP-RM 
(aka Libos V2).  The
GA102+ version of GSP-RM (Libos V3) has completely different code for parsing 
the LibosArgs table
that doesn't check the size for alignment.

> Because if we align up past the valid data of the object, then what are
> we copying? Granted, `CoherentAllocation` will likely have an aligned
> size, but that's a lucky implementation detail. So maybe we can just
> downright return an error if the size is not aligned, which would solve
> the panic problem.
> 
> Or we fix the problem when allocating the `CoherentAllocation`, making
> sure the filler data exists and is zeroes, and providing a valid `size`
> from the beginning.

Yes, that is better.  In fact, now that I think about it, rounding up the size 
was a work-around
that I implemented and then forgot about.  The correct approach is to do what 
Nouveau does and
allocate 4K for RMARGS:

                ret = nvkm_gsp_mem_ctor(gsp, 0x1000, &gsp->rmargs);

Unfortunately, I don't know how to get Nova to allocate a larger block of 
memory for an object than
its size.  The RMARGS struct is defined in bindings.rs:

pub struct GSP_ARGUMENTS_CACHED {
    pub messageQueueInitArguments: MESSAGE_QUEUE_INIT_ARGUMENTS,
    pub srInitArguments: GSP_SR_INIT_ARGUMENTS,
    pub gpuInstance: u32_,
    pub bDmemStack: u8_,
    pub __bindgen_padding_0: [u8; 7usize],
    pub profilerArgs: GSP_ARGUMENTS_CACHED__bindgen_ty_1,
}

which I believe is 72 bytes.

It is allocated here:

        let rmargs = CoherentAllocation::<GspArgumentsCached>::alloc_coherent(
            dev,
            1,
            GFP_KERNEL | __GFP_ZERO,
        )?;


The "1" is a count field.  Unfortunately, 4096/72 is not an even number, so I 
can't just change
`count` to another number.

If you can tell me how to fix the CoherentAllocation allocation call so that it 
allocates 4K, that
would be a better fix for this problem.  It's only necessary for Turing/GA100, 
however.

Reply via email to