On Tue, Aug 22, 2023 at 6:55 PM Faith Ekstrand <fa...@gfxstrand.net> wrote:
> On Tue, Aug 22, 2023 at 4:51 AM Christian König <christian.koe...@amd.com> > wrote: > >> Am 21.08.23 um 21:46 schrieb Faith Ekstrand: >> >> On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koe...@amd.com> >> wrote: >> >>> [SNIP] >>> So as long as nobody from userspace comes and says we absolutely need to >>> optimize this use case I would rather not do it. >>> >> >> This is a place where nouveau's needs are legitimately different from AMD >> or Intel, I think. NVIDIA's command streamer model is very different from >> AMD and Intel. On AMD and Intel, each EXEC turns into a single small >> packet (on the order of 16B) which kicks off a command buffer. There may >> be a bit of cache management or something around it but that's it. From >> there, it's userspace's job to make one command buffer chain to another >> until it's finally done and then do a "return", whatever that looks like. >> >> NVIDIA's model is much more static. Each packet in the HW/FW ring is an >> address and a size and that much data is processed and then it grabs the >> next packet and processes. The result is that, if we use multiple buffers >> of commands, there's no way to chain them together. We just have to pass >> the whole list of buffers to the kernel. >> >> >> So far that is actually completely identical to what AMD has. >> >> A single EXEC ioctl / job may have 500 such addr+size packets depending >> on how big the command buffer is. >> >> >> And that is what I don't understand. Why would you need 100dreds of such >> addr+size packets? >> > > Well, we're not really in control of it. We can control our base pushbuf > size and that's something we can tune but we're still limited by the > client. We have to submit another pushbuf whenever: > > 1. We run out of space (power-of-two growth is also possible but the size > is limited to a maximum of about 4MiB due to hardware limitations.) > 2. The client calls a secondary command buffer. > 3. Any usage of indirect draw or dispatch on pre-Turing hardware. > > At some point we need to tune our BO size a bit to avoid (1) while also > avoiding piles of tiny BOs. However, (2) and (3) are out of our control. > > This is basically identical to what AMD has (well on newer hw there is an >> extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't >> widely used as far as I know). >> > > According to Bas, RADV chains on recent hardware. > well: 1) on GFX6 and older we can't chain at all 2) on Compute/DMA we can't chain at all 3) with VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT we can't chain between cmdbuffers 4) for some secondary use cases we can't chain. so we have to do the "submit multiple" dance in many cases. > > >> Previously the limit was something like 4 which we extended to because >> Bas came up with similar requirements for the AMD side from RADV. >> >> But essentially those approaches with 100dreds of IBs doesn't sound like >> a good idea to me. >> > > No one's arguing that they like it. Again, the hardware isn't designed to > have a kernel in the way. It's designed to be fed by userspace. But we're > going to have the kernel in the middle for a while so we need to make it > not suck too bad. > > ~Faith > > It gets worse on pre-Turing hardware where we have to split the batch for >> every single DrawIndirect or DispatchIndirect. >> >> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable >> model if you assume that userspace is feeding the firmware. When that's >> happening, you just have a userspace thread that sits there and feeds the >> ringbuffer with whatever is next and you can marshal as much data through >> as you want. Sure, it'd be nice to have a 2nd level batch thing that gets >> launched from the FW ring and has all the individual launch commands but >> it's not at all necessary. >> >> What does that mean from a gpu_scheduler PoV? Basically, it means a >> variable packet size. >> >> What does this mean for implementation? IDK. One option would be to >> teach the scheduler about actual job sizes. Another would be to virtualize >> it and have another layer underneath the scheduler that does the actual >> feeding of the ring. Another would be to decrease the job size somewhat and >> then have the front-end submit as many jobs as it needs to service >> userspace and only put the out-fences on the last job. All the options >> kinda suck. >> >> >> Yeah, agree. The job size Danilo suggested is still the least painful. >> >> Christian. >> >> >> ~Faith >> >> >>