> But because of driver differences I can't implement it as a straight wait > queue. Some drivers may not have a reliable interrupt, so they need a custom > wait function. (qxl) > Some may need to do extra flushing to get fences signaled (vmwgfx), others > need some locking to protect against gpu lockup races (radeon, i915??). And > nouveau > doesn't use wait queues, but rolls its own (nouveau). But when all those drivers need a special wait function how can you still justify the common callback when a fence is signaled?
If I understood it right the use case for this was waiting for any fence of a list of fences from multiple drivers, but if each driver needs special handling how for it's wait how can that work reliable? Christian. Am 14.08.2014 um 11:15 schrieb Maarten Lankhorst: > Op 13-08-14 om 19:07 schreef Jerome Glisse: >> On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote: >>> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote: >>>> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote: >>>>> On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote: >>>>>> Hi, >>>>>> >>>>>> So i want over the whole fence and sync point stuff as it's becoming a >>>>>> pressing >>>>>> issue. I think we first need to agree on what is the problem we want to >>>>>> solve >>>>>> and what would be the requirements to solve it. >>>>>> >>>>>> Problem : >>>>>> Explicit synchronization btw different hardware block over a buffer >>>>>> object. >>>>>> >>>>>> Requirements : >>>>>> Share common infrastructure. >>>>>> Allow optimal hardware command stream scheduling accross hardware >>>>>> block. >>>>>> Allow android sync point to be implemented on top of it. >>>>>> Handle/acknowledge exception (like good old gpu lockup). >>>>>> Minimize driver changes. >>>>>> >>>>>> Glossary : >>>>>> hardware timeline: timeline bound to a specific hardware block. >>>>>> pipeline timeline: timeline bound to a userspace rendering pipeline, >>>>>> each >>>>>> point on that timeline can be a composite of >>>>>> several >>>>>> different hardware pipeline point. >>>>>> pipeline: abstract object representing userspace application graphic >>>>>> pipeline >>>>>> of each of the application graphic operations. >>>>>> fence: specific point in a timeline where synchronization needs to >>>>>> happen. >>>>>> >>>>>> >>>>>> So now, current include/linux/fence.h implementation is i believe >>>>>> missing the >>>>>> objective by confusing hardware and pipeline timeline and by bolting >>>>>> fence to >>>>>> buffer object while what is really needed is true and proper timeline >>>>>> for both >>>>>> hardware and pipeline. But before going further down that road let me >>>>>> look at >>>>>> things and explain how i see them. >>>>> fences can be used free-standing and no one forces you to integrate them >>>>> with buffers. We actually plan to go this way with the intel svm stuff. >>>>> Ofc for dma-buf the plan is to synchronize using such fences, but that's >>>>> somewhat orthogonal I think. At least you only talk about fences and >>>>> timeline and not dma-buf here. >>>>> >>>>>> Current ttm fence have one and a sole purpose, allow synchronization for >>>>>> buffer >>>>>> object move even thought some driver like radeon slightly abuse it and >>>>>> use them >>>>>> for things like lockup detection. >>>>>> >>>>>> The new fence want to expose an api that would allow some implementation >>>>>> of a >>>>>> timeline. For that it introduces callback and some hard requirement on >>>>>> what the >>>>>> driver have to expose : >>>>>> enable_signaling >>>>>> [signaled] >>>>>> wait >>>>>> >>>>>> Each of those have to do work inside the driver to which the fence >>>>>> belongs and >>>>>> each of those can be call more or less from unexpected (with restriction >>>>>> like >>>>>> outside irq) context. So we end up with thing like : >>>>>> >>>>>> Process 1 Process 2 Process 3 >>>>>> I_A_schedule(fence0) >>>>>> CI_A_F_B_signaled(fence0) >>>>>> I_A_signal(fence0) >>>>>> >>>>>> CI_B_F_A_callback(fence0) >>>>>> CI_A_F_B_wait(fence0) >>>>>> Lexique: >>>>>> I_x in driver x (I_A == in driver A) >>>>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from >>>>>> driver B) >>>>>> >>>>>> So this is an happy mess everyone call everyone and this bound to get >>>>>> messy. >>>>>> Yes i know there is all kind of requirement on what happen once a fence >>>>>> is >>>>>> signaled. But those requirement only looks like they are trying to atone >>>>>> any >>>>>> mess that can happen from the whole callback dance. >>>>>> >>>>>> While i was too seduced by the whole callback idea long time ago, i >>>>>> think it is >>>>>> a highly dangerous path to take where the combinatorial of what could >>>>>> happen >>>>>> are bound to explode with the increase in the number of players. >>>>>> >>>>>> >>>>>> So now back to how to solve the problem we are trying to address. First >>>>>> i want >>>>>> to make an observation, almost all GPU that exist today have a command >>>>>> ring >>>>>> on to which userspace command buffer are executed and inside the command >>>>>> ring >>>>>> you can do something like : >>>>>> >>>>>> if (condition) execute_command_buffer else skip_command_buffer >>>>>> >>>>>> where condition is a simple expression (memory_address cop value)) with >>>>>> cop one >>>>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe >>>>>> assumption >>>>>> that any gpu that slightly matter can do that. Those who can not should >>>>>> fix >>>>>> there command ring processor. >>>>>> >>>>>> >>>>>> With that in mind, i think proper solution is implementing timeline and >>>>>> having >>>>>> fence be a timeline object with a way simpler api. For each hardware >>>>>> timeline >>>>>> driver provide a system memory address at which the lastest signaled >>>>>> fence >>>>>> sequence number can be read. Each fence object is uniquely associated >>>>>> with >>>>>> both a hardware and a pipeline timeline. Each pipeline timeline have a >>>>>> wait >>>>>> queue. >>>>>> >>>>>> When scheduling something that require synchronization on a hardware >>>>>> timeline >>>>>> a fence is created and associated with the pipeline timeline and hardware >>>>>> timeline. Other hardware block that need to wait on a fence can use there >>>>>> command ring conditional execution to directly check the fence sequence >>>>>> from >>>>>> the other hw block so you do optimistic scheduling. If optimistic >>>>>> scheduling >>>>>> fails (which would be reported by hw block specific solution and hidden) >>>>>> then >>>>>> things can fallback to software cpu wait inside what could be considered >>>>>> the >>>>>> kernel thread of the pipeline timeline. >>>>>> >>>>>> >>>>>> From api point of view there is no inter-driver call. All the driver >>>>>> needs to >>>>>> do is wakeup the pipeline timeline wait_queue when things are signaled or >>>>>> when things go sideway (gpu lockup). >>>>>> >>>>>> >>>>>> So how to implement that with current driver ? Well easy. Currently we >>>>>> assume >>>>>> implicit synchronization so all we need is an implicit pipeline timeline >>>>>> per >>>>>> userspace process (note this do not prevent inter process >>>>>> synchronization). >>>>>> Everytime a command buffer is submitted it is added to the implicit >>>>>> timeline >>>>>> with the simple fence object : >>>>>> >>>>>> struct fence { >>>>>> struct list_head list_hwtimeline; >>>>>> struct list_head list_pipetimeline; >>>>>> struct hw_timeline *hw_timeline; >>>>>> uint64_t seq_num; >>>>>> work_t timedout_work; >>>>>> void *csdata; >>>>>> }; >>>>>> >>>>>> So with set of helper function call by each of the driver command >>>>>> execution >>>>>> ioctl you have the implicit timeline that is properly populated and each >>>>>> dirver command execution get the dependency from the implicit timeline. >>>>>> >>>>>> >>>>>> Of course to take full advantages of all flexibilities this could offer >>>>>> we >>>>>> would need to allow userspace to create pipeline timeline and to schedule >>>>>> against the pipeline timeline of there choice. We could create file for >>>>>> each of the pipeline timeline and have file operation to wait/query >>>>>> progress. >>>>>> >>>>>> Note that the gpu lockup are considered exceptional event, the implicit >>>>>> timeline will probably want to continue on other job on other hardware >>>>>> block but the explicit one probably will want to decide wether to >>>>>> continue >>>>>> or abort or retry without the fault hw block. >>>>>> >>>>>> >>>>>> I realize i am late to the party and that i should have taken a serious >>>>>> look at all this long time ago. I apologize for that and if you consider >>>>>> this is to late then just ignore me modulo the big warning the crazyness >>>>>> that callback will introduce an how bad things bound to happen. I am not >>>>>> saying that bad things can not happen with what i propose just that >>>>>> because everything happen inside the process context that is the one >>>>>> asking/requiring synchronization there will be not interprocess kernel >>>>>> callback (a callback that was registered by one process and that is call >>>>>> inside another process time slice because fence signaling is happening >>>>>> inside this other process time slice). >>>>> So I read through it all and presuming I understand it correctly your >>>>> proposal and what we currently have is about the same. The big difference >>>>> is that you make a timeline a first-class object and move the callback >>>>> queue from the fence to the timeline, which requires callers to check the >>>>> fence/seqno/whatever themselves instead of pushing that responsibility to >>>>> callers. >>>> No, big difference is that there is no callback thus when waiting for a >>>> fence you are either inside the process context that need to wait for it >>>> or your inside a kernel thread process context. Which means in both case >>>> you can do whatever you want. What i hate about the fence code as it is, >>>> is the callback stuff, because you never know into which context fence >>>> are signaled then you never know into which context callback are executed. >>> Look at waitqueues a bit closer. They're implemented with callbacks ;-) >>> The only difference is that you're allowed to have spurious wakeups and >>> need to handle that somehow, so need a separate check function. >> No this not how wait queue are implemented, ie wait queue do not callback a >> random function from a random driver, it callback a limited set of function >> from core linux kernel scheduler so that the process thread that was waiting >> and out of the scheduler list is added back and marked as something that >> should be schedule. Unless this part of the kernel drasticly changed for the >> worse recently. >> >> So this is fundamentaly different, fence as they are now allow random driver >> callback and this is bound to get ugly this is bound to lead to one driver >> doing something that seems innocuous but turn out to break heavoc when call >> from some other driver function. > No, really, look closer. > > fence_default_wait adds a callback to fence_default_wait_cb, which wakes up > the waiting thread if the fence gets signaled. > The callback calls wake_up_state, which calls try_to_wake up. > > default_wake_function, which is used by wait queues does something similar, > it calls try_to_wake_up. > > Fence now has some additional checks, but originally it was implemented as a > wait queue. > > But because of driver differences I can't implement it as a straight wait > queue. Some drivers may not have a reliable interrupt, so they need a custom > wait function. (qxl) > Some may need to do extra flushing to get fences signaled (vmwgfx), others > need some locking to protect against gpu lockup races (radeon, i915??). And > nouveau > doesn't use wait queues, but rolls its own (nouveau). > > Fences also don't imply implicit sync, you can use explicit sync if you want. > > I posted a patch for this, but if you want to create an android userspace > fence, call > > struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt) > > > I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 > wants to use it. > > ~Maarten >