On Tue, Aug 22, 2023 at 6:55 PM Faith Ekstrand <fa...@gfxstrand.net> wrote:

> On Tue, Aug 22, 2023 at 4:51 AM Christian König <christian.koe...@amd.com>
> wrote:
>
>> Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
>>
>> On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koe...@amd.com>
>> wrote:
>>
>>> [SNIP]
>>> So as long as nobody from userspace comes and says we absolutely need to
>>> optimize this use case I would rather not do it.
>>>
>>
>> This is a place where nouveau's needs are legitimately different from AMD
>> or Intel, I think.  NVIDIA's command streamer model is very different from
>> AMD and Intel.  On AMD and Intel, each EXEC turns into a single small
>> packet (on the order of 16B) which kicks off a command buffer.  There may
>> be a bit of cache management or something around it but that's it.  From
>> there, it's userspace's job to make one command buffer chain to another
>> until it's finally done and then do a "return", whatever that looks like.
>>
>> NVIDIA's model is much more static.  Each packet in the HW/FW ring is an
>> address and a size and that much data is processed and then it grabs the
>> next packet and processes. The result is that, if we use multiple buffers
>> of commands, there's no way to chain them together.  We just have to pass
>> the whole list of buffers to the kernel.
>>
>>
>> So far that is actually completely identical to what AMD has.
>>
>> A single EXEC ioctl / job may have 500 such addr+size packets depending
>> on how big the command buffer is.
>>
>>
>> And that is what I don't understand. Why would you need 100dreds of such
>> addr+size packets?
>>
>
> Well, we're not really in control of it.  We can control our base pushbuf
> size and that's something we can tune but we're still limited by the
> client.  We have to submit another pushbuf whenever:
>
>  1. We run out of space (power-of-two growth is also possible but the size
> is limited to a maximum of about 4MiB due to hardware limitations.)
>  2. The client calls a secondary command buffer.
>  3. Any usage of indirect draw or dispatch on pre-Turing hardware.
>
> At some point we need to tune our BO size a bit to avoid (1) while also
> avoiding piles of tiny BOs.  However, (2) and (3) are out of our control.
>
> This is basically identical to what AMD has (well on newer hw there is an
>> extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't
>> widely used as far as I know).
>>
>
> According to Bas, RADV chains on recent hardware.
>

well:

1) on GFX6 and older we can't chain at all
2) on Compute/DMA we can't chain at all
3) with VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT we can't chain between
cmdbuffers
4) for some secondary use cases we can't chain.

so we have to do the "submit multiple" dance in many cases.

>
>
>> Previously the limit was something like 4 which we extended to because
>> Bas came up with similar requirements for the AMD side from RADV.
>>
>> But essentially those approaches with 100dreds of IBs doesn't sound like
>> a good idea to me.
>>
>
> No one's arguing that they like it.  Again, the hardware isn't designed to
> have a kernel in the way. It's designed to be fed by userspace. But we're
> going to have the kernel in the middle for a while so we need to make it
> not suck too bad.
>
> ~Faith
>
> It gets worse on pre-Turing hardware where we have to split the batch for
>> every single DrawIndirect or DispatchIndirect.
>>
>> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable
>> model if you assume that userspace is feeding the firmware.  When that's
>> happening, you just have a userspace thread that sits there and feeds the
>> ringbuffer with whatever is next and you can marshal as much data through
>> as you want. Sure, it'd be nice to have a 2nd level batch thing that gets
>> launched from the FW ring and has all the individual launch commands but
>> it's not at all necessary.
>>
>> What does that mean from a gpu_scheduler PoV? Basically, it means a
>> variable packet size.
>>
>> What does this mean for implementation? IDK.  One option would be to
>> teach the scheduler about actual job sizes. Another would be to virtualize
>> it and have another layer underneath the scheduler that does the actual
>> feeding of the ring. Another would be to decrease the job size somewhat and
>> then have the front-end submit as many jobs as it needs to service
>> userspace and only put the out-fences on the last job. All the options
>> kinda suck.
>>
>>
>> Yeah, agree. The job size Danilo suggested is still the least painful.
>>
>> Christian.
>>
>>
>> ~Faith
>>
>>
>>

Reply via email to