On Thu, 2009-01-22 at 22:34 +0100, Thomas Hellström wrote:
> Jerome Glisse wrote:
> > On Thu, 2009-01-22 at 20:54 +0100, Thomas Hellström wrote:
> >   
> >> Jerome Glisse wrote:
> >>     
> >>> Hi Thomas,
> >>>
> >>> I did go through new ttm and i would like to get your feedback on
> >>> change i think are worthwhile. Current design enforce driver to use
> >>> the ttm memory placement and movement logic. I would like to let that
> >>> up to the driver and factor out part which do the low level work,
> >>> basically it end up to split : 
> >>>
> >>> -mmap buffer to userspace
> >>> -kmap buffer
> >>> -allocate page
> >>> -setup pat or mtrr
> >>> -invalidate mapping
> >>> -Other things i might just forget about right now :)
> >>>
> >>> >From :
> >>> -buffer movement
> >>> -buffer validation
> >>> -buffer eviction
> >>> -...
> >>>
> >>> The motivation behind this are that current design will likely often
> >>> end up in stalling GPU each time we are doing buffer validation. This
> >>> doesn't sounds like a good plan.
> >>>
> >>> Here is what would happen on radeon if i correctly understand current
> >>> ttm movement code (like eviction path when we move a bo from system to
> >>> vram, or vram to system). Bo movement will be scheduled in command
> >>> stream and CPU will wait until the GPU is done moving once this happen
> >>> this likely mean that is nor more scheduled work in the GPU pipeline.
> >>>
> >>> For radeon the idea i did have was to use 2 list :
> >>> -GART (could be AGP, PCIE, PCI) bind list
> >>> -GART (same) unbind list
> >>> GPU will wait on the bind list and CPU will wait on the unbind list.
> >>> So when we do validation we find where all BO of the CS would be
> >>> placed (could depend on hw and userspace hint) and we accordingly
> >>> setup :
> >>> (Validation of a command stream buffer produce this some of this
> >>> could be empty if there is enough room)
> >>> -GART unbind (to make room in gart if necessary)
> >>> -GART bind (to bind system memory where we will save BO evicted from
> >>>  vram)
> >>> -gpu cmd to dma into GART the evicted bo
> >>> -GART unbind (to make room in gart if necessary could be
> >>>  merged with first unbind)
> >>> -GART (bind BO referenced in the CS and which are moved into VRAM)
> >>> -gpu cmd to dma into VRAM
> >>> -GART unbind (if necessary again could be merged with previous unbind)
> >>> -GART bind (remaining BO if any which stay in the gart for cs)
> >>> -cs
> >>>
> >>> It was the worst case scenario where we use all the GART area. Looks
> >>> complex but the idea is that most of the unbinding and binding could be
> >>> scheduled before the GPU even have to wait for it. This scheme also
> >>> allow to add in the future some kind of GPU cs scheduler which can try
> >>> to minimize BO movement by rescheduling cs. Anyway the assumption here
> >>> is that most of the time we will have this situation:
> >>> -bind BO referenced by cs1
> >>> -GPU dma
> >>> -signal to CPU to unbind DMAed bo (GPU keep working)
> >>> -cs1
> >>> -bind BO referenced by cs2 (GPU is still working on cs1
> >>>  but we have enough room)
> >>> -signal cs1 is done to CPU and all referenced BO could
> >>>  be unbind
> >>> -GPU dma
> >>> -cs2
> >>> ...
> >>>
> >>> GPU is always busy and CPU is waken up through irq to bind/unbind to
> >>> minimize the chance of GPU having to wait on the CPU binding/unbinding
> >>> activities.
> >>>
> >>> Also on some newer GPU and PCIE if there is no iommu (or it could be
> >>> ignored) GPU can do the binding unbinding on it's own and so never wait
> >>> on the CPU. I need to checkup what iommu design require.
> >>>
> >>>
> >>> So i would like to isolate ttm function which provide core utilities
> >>> (mapping,swapping,allocating page, setting cache, ...) from the logic
> >>> which drive BO movement/validation. Do you think it's good plan ?
> >>> Do you have any requirement or wish on the way i could do that ?
> >>>
> >>> Note that i don't intend to implement the scheme i just described
> >>> in the first shot but rather to stick with the bounded BO movement/
> >>> placement/validation scheme as first step. Then i can slowly try
> >>> implementing this scheme or any other better scheme one might come
> >>> up with :)
> >>>
> >>> Cheers,
> >>> Jerome Glisse
> >>>
> >>>   
> >>>       
> >> Hi, Jerome!
> >>
> >> So, If I understand you correctly you want to make a smaller bo class on 
> >> which the low level functions operate, (basically just something similar 
> >> to a struct ttm_tt), so that one can derive from that and implement a 
> >> more elaborate migration scheme?
> >>
> >> I think that is a good idea, and in that case one could perhaps keep the 
> >> current implementation as a special case.
> >>
> >> What I'm a little worried about, though, is that there will come a lot 
> >> of new ideas that will postpone an upstream move again, but at this 
> >> point I don't see such a splitting doing that, since it will be only 
> >> kernel internals.
> >>
> >> But, before you start looking at this, remember that the current 
> >> implementation is using unnecessary synchronization when moving. There 
> >> is, for example no need to synchronize if the GPU is used to DMA from 
> >> system to VRAM. That's basically because I've been lazy, and it's a bit 
> >> hard to debug when buffer data is in flight, and even worse if a 
> >> pipelined buffer move fails :(.
> >>
> >> Anyway, It's entirely possible to remove the synchronization from the 
> >> current buffer move logic, and let the driver handle it. The only 
> >> synchronization points would be bo_move_memcpy and bo_move_tt.
> >> The driver could then, in its own move functions, do what's necessary to 
> >> sync.
> >>
> >> /Thomas
> >>     
> >
> > I don't want neither to postpone upstream acceptance of this code, this
> > is why i asked first, i will try to come up with a patch but ttm_tt
> > struct is exactly what i did have in mind, i haven't looked deeply
> > but i think i mostly need to isolate few things so most of the function
> > dealing with ttm_tt doesn't require too much from other part of ttm.
> >
> > Also i think there are few things that kernel people might pester
> > with (like using int when the data is a boolean) will also try to
> > go through the code with a kernel reviewer minds ;)
> >
> > Will push in my own repo once i got somethings which at least compile,
> > so you could review & comment.
> >
> > Cheers,
> > Jerome
> >
> >   
> Sounds good.
> 
> I've attached a small patch that will make it possible to accomplish 
> full pipelining of buffer moves
> with the current code, by moving all the synchronization to the 
> driver->move() hook. Also note that it's possible to attach a signaled() 
> callback to a fence object so that you can have that do the asynchronous 
> unbind when the accelerated move to TT is done.
> 
> What the driver->move() will have to do is to schedule all moves on the 
> command stream and then return immediately after registering a handler 
> that will do any cleanup. Then when it's time to move in the buffer to 
> validate, check if there is any synchronization needed (for example if 
> the move is done on a different engine), and then go ahead and schedule 
> the validation move.
> 
> There may be reasons to do the splitting anyway, but I just wanted you 
> to be aware of this way of doing things. I haven't tested the patch with 
> our drivers yet.
> 
> Regards,
> Thomas

Yup i didn't thought to that solution, definitely sounds doable, will
ponder with the code a bit more :)

Cheers,
Jerome


------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to