On Thu, 2025-12-04 at 23:43 +0900, Alexandre Courbot wrote:
> And yeah, you are abolutely right, but my point was more about not
> having the code to handle panic conditions generated. Maybe I am
> thinking too much ahead, but I dream of a future where we could make
> guarantees like "this function never panics" and have the compiler
> complain if it does. So as a matter of principle I like to avoid having
> these, especially when they cannot happen in practice.
I can totally agree with that, but I don't know how to tell the compiler that
obj.size() will never
be close to 2^64.
The issue only occurs with RMARGS, the last entry in the LibosArgs array. That
is a 72-byte struct,
IIRC. All of the other entries in the array are logging buffers that are
already 4K aligned.
> So something like using `pr_warn` looks reasonable to me as a last
> resort.
>
> ... or maybe we can address the problem differently. Reading your commit
> log again:
>
> GSP-RM insists that the 'size' parameter of the
> LibosMemoryRegionInitArgument struct be aligned to 4KB.
>
> sounds to me like "it is a bug if `size` is not aligned to 4KB to begin
> with". Could that be a correct interpretation?
Short answer: yes. I had to dig real deep into the firmware to find this
restriction. GSP-RM has
this check:
#define LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL 0x1000ULL
INIT_PANIC_IF((init_args->size & (LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL
- 1)) != 0,
LibosPanicReasonInvalidConfiguration, init_args->size,
"Unaligned size specified");
For some reason, this code exists only in the Turing/GA100 version of GSP-RM
(aka Libos V2). The
GA102+ version of GSP-RM (Libos V3) has completely different code for parsing
the LibosArgs table
that doesn't check the size for alignment.
> Because if we align up past the valid data of the object, then what are
> we copying? Granted, `CoherentAllocation` will likely have an aligned
> size, but that's a lucky implementation detail. So maybe we can just
> downright return an error if the size is not aligned, which would solve
> the panic problem.
>
> Or we fix the problem when allocating the `CoherentAllocation`, making
> sure the filler data exists and is zeroes, and providing a valid `size`
> from the beginning.
Yes, that is better. In fact, now that I think about it, rounding up the size
was a work-around
that I implemented and then forgot about. The correct approach is to do what
Nouveau does and
allocate 4K for RMARGS:
ret = nvkm_gsp_mem_ctor(gsp, 0x1000, &gsp->rmargs);
Unfortunately, I don't know how to get Nova to allocate a larger block of
memory for an object than
its size. The RMARGS struct is defined in bindings.rs:
pub struct GSP_ARGUMENTS_CACHED {
pub messageQueueInitArguments: MESSAGE_QUEUE_INIT_ARGUMENTS,
pub srInitArguments: GSP_SR_INIT_ARGUMENTS,
pub gpuInstance: u32_,
pub bDmemStack: u8_,
pub __bindgen_padding_0: [u8; 7usize],
pub profilerArgs: GSP_ARGUMENTS_CACHED__bindgen_ty_1,
}
which I believe is 72 bytes.
It is allocated here:
let rmargs = CoherentAllocation::<GspArgumentsCached>::alloc_coherent(
dev,
1,
GFP_KERNEL | __GFP_ZERO,
)?;
The "1" is a count field. Unfortunately, 4096/72 is not an even number, so I
can't just change
`count` to another number.
If you can tell me how to fix the CoherentAllocation allocation call so that it
allocates 4K, that
would be a better fix for this problem. It's only necessary for Turing/GA100,
however.