On Thu, 2009-01-22 at 22:34 +0100, Thomas Hellström wrote: > Jerome Glisse wrote: > > On Thu, 2009-01-22 at 20:54 +0100, Thomas Hellström wrote: > > > >> Jerome Glisse wrote: > >> > >>> Hi Thomas, > >>> > >>> I did go through new ttm and i would like to get your feedback on > >>> change i think are worthwhile. Current design enforce driver to use > >>> the ttm memory placement and movement logic. I would like to let that > >>> up to the driver and factor out part which do the low level work, > >>> basically it end up to split : > >>> > >>> -mmap buffer to userspace > >>> -kmap buffer > >>> -allocate page > >>> -setup pat or mtrr > >>> -invalidate mapping > >>> -Other things i might just forget about right now :) > >>> > >>> >From : > >>> -buffer movement > >>> -buffer validation > >>> -buffer eviction > >>> -... > >>> > >>> The motivation behind this are that current design will likely often > >>> end up in stalling GPU each time we are doing buffer validation. This > >>> doesn't sounds like a good plan. > >>> > >>> Here is what would happen on radeon if i correctly understand current > >>> ttm movement code (like eviction path when we move a bo from system to > >>> vram, or vram to system). Bo movement will be scheduled in command > >>> stream and CPU will wait until the GPU is done moving once this happen > >>> this likely mean that is nor more scheduled work in the GPU pipeline. > >>> > >>> For radeon the idea i did have was to use 2 list : > >>> -GART (could be AGP, PCIE, PCI) bind list > >>> -GART (same) unbind list > >>> GPU will wait on the bind list and CPU will wait on the unbind list. > >>> So when we do validation we find where all BO of the CS would be > >>> placed (could depend on hw and userspace hint) and we accordingly > >>> setup : > >>> (Validation of a command stream buffer produce this some of this > >>> could be empty if there is enough room) > >>> -GART unbind (to make room in gart if necessary) > >>> -GART bind (to bind system memory where we will save BO evicted from > >>> vram) > >>> -gpu cmd to dma into GART the evicted bo > >>> -GART unbind (to make room in gart if necessary could be > >>> merged with first unbind) > >>> -GART (bind BO referenced in the CS and which are moved into VRAM) > >>> -gpu cmd to dma into VRAM > >>> -GART unbind (if necessary again could be merged with previous unbind) > >>> -GART bind (remaining BO if any which stay in the gart for cs) > >>> -cs > >>> > >>> It was the worst case scenario where we use all the GART area. Looks > >>> complex but the idea is that most of the unbinding and binding could be > >>> scheduled before the GPU even have to wait for it. This scheme also > >>> allow to add in the future some kind of GPU cs scheduler which can try > >>> to minimize BO movement by rescheduling cs. Anyway the assumption here > >>> is that most of the time we will have this situation: > >>> -bind BO referenced by cs1 > >>> -GPU dma > >>> -signal to CPU to unbind DMAed bo (GPU keep working) > >>> -cs1 > >>> -bind BO referenced by cs2 (GPU is still working on cs1 > >>> but we have enough room) > >>> -signal cs1 is done to CPU and all referenced BO could > >>> be unbind > >>> -GPU dma > >>> -cs2 > >>> ... > >>> > >>> GPU is always busy and CPU is waken up through irq to bind/unbind to > >>> minimize the chance of GPU having to wait on the CPU binding/unbinding > >>> activities. > >>> > >>> Also on some newer GPU and PCIE if there is no iommu (or it could be > >>> ignored) GPU can do the binding unbinding on it's own and so never wait > >>> on the CPU. I need to checkup what iommu design require. > >>> > >>> > >>> So i would like to isolate ttm function which provide core utilities > >>> (mapping,swapping,allocating page, setting cache, ...) from the logic > >>> which drive BO movement/validation. Do you think it's good plan ? > >>> Do you have any requirement or wish on the way i could do that ? > >>> > >>> Note that i don't intend to implement the scheme i just described > >>> in the first shot but rather to stick with the bounded BO movement/ > >>> placement/validation scheme as first step. Then i can slowly try > >>> implementing this scheme or any other better scheme one might come > >>> up with :) > >>> > >>> Cheers, > >>> Jerome Glisse > >>> > >>> > >>> > >> Hi, Jerome! > >> > >> So, If I understand you correctly you want to make a smaller bo class on > >> which the low level functions operate, (basically just something similar > >> to a struct ttm_tt), so that one can derive from that and implement a > >> more elaborate migration scheme? > >> > >> I think that is a good idea, and in that case one could perhaps keep the > >> current implementation as a special case. > >> > >> What I'm a little worried about, though, is that there will come a lot > >> of new ideas that will postpone an upstream move again, but at this > >> point I don't see such a splitting doing that, since it will be only > >> kernel internals. > >> > >> But, before you start looking at this, remember that the current > >> implementation is using unnecessary synchronization when moving. There > >> is, for example no need to synchronize if the GPU is used to DMA from > >> system to VRAM. That's basically because I've been lazy, and it's a bit > >> hard to debug when buffer data is in flight, and even worse if a > >> pipelined buffer move fails :(. > >> > >> Anyway, It's entirely possible to remove the synchronization from the > >> current buffer move logic, and let the driver handle it. The only > >> synchronization points would be bo_move_memcpy and bo_move_tt. > >> The driver could then, in its own move functions, do what's necessary to > >> sync. > >> > >> /Thomas > >> > > > > I don't want neither to postpone upstream acceptance of this code, this > > is why i asked first, i will try to come up with a patch but ttm_tt > > struct is exactly what i did have in mind, i haven't looked deeply > > but i think i mostly need to isolate few things so most of the function > > dealing with ttm_tt doesn't require too much from other part of ttm. > > > > Also i think there are few things that kernel people might pester > > with (like using int when the data is a boolean) will also try to > > go through the code with a kernel reviewer minds ;) > > > > Will push in my own repo once i got somethings which at least compile, > > so you could review & comment. > > > > Cheers, > > Jerome > > > > > Sounds good. > > I've attached a small patch that will make it possible to accomplish > full pipelining of buffer moves > with the current code, by moving all the synchronization to the > driver->move() hook. Also note that it's possible to attach a signaled() > callback to a fence object so that you can have that do the asynchronous > unbind when the accelerated move to TT is done. > > What the driver->move() will have to do is to schedule all moves on the > command stream and then return immediately after registering a handler > that will do any cleanup. Then when it's time to move in the buffer to > validate, check if there is any synchronization needed (for example if > the move is done on a different engine), and then go ahead and schedule > the validation move. > > There may be reasons to do the splitting anyway, but I just wanted you > to be aware of this way of doing things. I haven't tested the patch with > our drivers yet. > > Regards, > Thomas
Yup i didn't thought to that solution, definitely sounds doable, will ponder with the code a bit more :) Cheers, Jerome ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword -- _______________________________________________ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel