Fence, timeline and android sync points

Christian König Thu, 14 Aug 2014 13:53:47 +0200

> But because of driver differences I can't implement it as a straight wait 
> queue. Some drivers may not have a reliable interrupt, so they need a custom 
> wait function. (qxl)
> Some may need to do extra flushing to get fences signaled (vmwgfx), others 
> need some locking to protect against gpu lockup races (radeon, i915??).  And 
> nouveau
> doesn't use wait queues, but rolls its own (nouveau).
But when all those drivers need a special wait function how can you 
still justify the common callback when a fence is signaled?


If I understood it right the use case for this was waiting for any fence 
of a list of fences from multiple drivers, but if each driver needs 
special handling how for it's wait how can that work reliable?

Christian.

Am 14.08.2014 um 11:15 schrieb Maarten Lankhorst:
> Op 13-08-14 om 19:07 schreef Jerome Glisse:
>> On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
>>> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
>>>> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
>>>>> On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
>>>>>> Hi,
>>>>>>
>>>>>> So i want over the whole fence and sync point stuff as it's becoming a 
>>>>>> pressing
>>>>>> issue. I think we first need to agree on what is the problem we want to 
>>>>>> solve
>>>>>> and what would be the requirements to solve it.
>>>>>>
>>>>>> Problem :
>>>>>>    Explicit synchronization btw different hardware block over a buffer 
>>>>>> object.
>>>>>>
>>>>>> Requirements :
>>>>>>    Share common infrastructure.
>>>>>>    Allow optimal hardware command stream scheduling accross hardware 
>>>>>> block.
>>>>>>    Allow android sync point to be implemented on top of it.
>>>>>>    Handle/acknowledge exception (like good old gpu lockup).
>>>>>>    Minimize driver changes.
>>>>>>
>>>>>> Glossary :
>>>>>>    hardware timeline: timeline bound to a specific hardware block.
>>>>>>    pipeline timeline: timeline bound to a userspace rendering pipeline, 
>>>>>> each
>>>>>>                       point on that timeline can be a composite of 
>>>>>> several
>>>>>>                       different hardware pipeline point.
>>>>>>    pipeline: abstract object representing userspace application graphic 
>>>>>> pipeline
>>>>>>              of each of the application graphic operations.
>>>>>>    fence: specific point in a timeline where synchronization needs to 
>>>>>> happen.
>>>>>>
>>>>>>
>>>>>> So now, current include/linux/fence.h implementation is i believe 
>>>>>> missing the
>>>>>> objective by confusing hardware and pipeline timeline and by bolting 
>>>>>> fence to
>>>>>> buffer object while what is really needed is true and proper timeline 
>>>>>> for both
>>>>>> hardware and pipeline. But before going further down that road let me 
>>>>>> look at
>>>>>> things and explain how i see them.
>>>>> fences can be used free-standing and no one forces you to integrate them
>>>>> with buffers. We actually plan to go this way with the intel svm stuff.
>>>>> Ofc for dma-buf the plan is to synchronize using such fences, but that's
>>>>> somewhat orthogonal I think. At least you only talk about fences and
>>>>> timeline and not dma-buf here.
>>>>>   
>>>>>> Current ttm fence have one and a sole purpose, allow synchronization for 
>>>>>> buffer
>>>>>> object move even thought some driver like radeon slightly abuse it and 
>>>>>> use them
>>>>>> for things like lockup detection.
>>>>>>
>>>>>> The new fence want to expose an api that would allow some implementation 
>>>>>> of a
>>>>>> timeline. For that it introduces callback and some hard requirement on 
>>>>>> what the
>>>>>> driver have to expose :
>>>>>>    enable_signaling
>>>>>>    [signaled]
>>>>>>    wait
>>>>>>
>>>>>> Each of those have to do work inside the driver to which the fence 
>>>>>> belongs and
>>>>>> each of those can be call more or less from unexpected (with restriction 
>>>>>> like
>>>>>> outside irq) context. So we end up with thing like :
>>>>>>
>>>>>>   Process 1              Process 2                   Process 3
>>>>>>   I_A_schedule(fence0)
>>>>>>                          CI_A_F_B_signaled(fence0)
>>>>>>                                                      I_A_signal(fence0)
>>>>>>                                                      
>>>>>> CI_B_F_A_callback(fence0)
>>>>>>                          CI_A_F_B_wait(fence0)
>>>>>> Lexique:
>>>>>> I_x  in driver x (I_A == in driver A)
>>>>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from 
>>>>>> driver B)
>>>>>>
>>>>>> So this is an happy mess everyone call everyone and this bound to get 
>>>>>> messy.
>>>>>> Yes i know there is all kind of requirement on what happen once a fence 
>>>>>> is
>>>>>> signaled. But those requirement only looks like they are trying to atone 
>>>>>> any
>>>>>> mess that can happen from the whole callback dance.
>>>>>>
>>>>>> While i was too seduced by the whole callback idea long time ago, i 
>>>>>> think it is
>>>>>> a highly dangerous path to take where the combinatorial of what could 
>>>>>> happen
>>>>>> are bound to explode with the increase in the number of players.
>>>>>>
>>>>>>
>>>>>> So now back to how to solve the problem we are trying to address. First 
>>>>>> i want
>>>>>> to make an observation, almost all GPU that exist today have a command 
>>>>>> ring
>>>>>> on to which userspace command buffer are executed and inside the command 
>>>>>> ring
>>>>>> you can do something like :
>>>>>>
>>>>>>    if (condition) execute_command_buffer else skip_command_buffer
>>>>>>
>>>>>> where condition is a simple expression (memory_address cop value)) with 
>>>>>> cop one
>>>>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe 
>>>>>> assumption
>>>>>> that any gpu that slightly matter can do that. Those who can not should 
>>>>>> fix
>>>>>> there command ring processor.
>>>>>>
>>>>>>
>>>>>> With that in mind, i think proper solution is implementing timeline and 
>>>>>> having
>>>>>> fence be a timeline object with a way simpler api. For each hardware 
>>>>>> timeline
>>>>>> driver provide a system memory address at which the lastest signaled 
>>>>>> fence
>>>>>> sequence number can be read. Each fence object is uniquely associated 
>>>>>> with
>>>>>> both a hardware and a pipeline timeline. Each pipeline timeline have a 
>>>>>> wait
>>>>>> queue.
>>>>>>
>>>>>> When scheduling something that require synchronization on a hardware 
>>>>>> timeline
>>>>>> a fence is created and associated with the pipeline timeline and hardware
>>>>>> timeline. Other hardware block that need to wait on a fence can use there
>>>>>> command ring conditional execution to directly check the fence sequence 
>>>>>> from
>>>>>> the other hw block so you do optimistic scheduling. If optimistic 
>>>>>> scheduling
>>>>>> fails (which would be reported by hw block specific solution and hidden) 
>>>>>> then
>>>>>> things can fallback to software cpu wait inside what could be considered 
>>>>>> the
>>>>>> kernel thread of the pipeline timeline.
>>>>>>
>>>>>>
>>>>>>  From api point of view there is no inter-driver call. All the driver 
>>>>>> needs to
>>>>>> do is wakeup the pipeline timeline wait_queue when things are signaled or
>>>>>> when things go sideway (gpu lockup).
>>>>>>
>>>>>>
>>>>>> So how to implement that with current driver ? Well easy. Currently we 
>>>>>> assume
>>>>>> implicit synchronization so all we need is an implicit pipeline timeline 
>>>>>> per
>>>>>> userspace process (note this do not prevent inter process 
>>>>>> synchronization).
>>>>>> Everytime a command buffer is submitted it is added to the implicit 
>>>>>> timeline
>>>>>> with the simple fence object :
>>>>>>
>>>>>> struct fence {
>>>>>>    struct list_head   list_hwtimeline;
>>>>>>    struct list_head   list_pipetimeline;
>>>>>>    struct hw_timeline *hw_timeline;
>>>>>>    uint64_t           seq_num;
>>>>>>    work_t             timedout_work;
>>>>>>    void               *csdata;
>>>>>> };
>>>>>>
>>>>>> So with set of helper function call by each of the driver command 
>>>>>> execution
>>>>>> ioctl you have the implicit timeline that is properly populated and each
>>>>>> dirver command execution get the dependency from the implicit timeline.
>>>>>>
>>>>>>
>>>>>> Of course to take full advantages of all flexibilities this could offer 
>>>>>> we
>>>>>> would need to allow userspace to create pipeline timeline and to schedule
>>>>>> against the pipeline timeline of there choice. We could create file for
>>>>>> each of the pipeline timeline and have file operation to wait/query
>>>>>> progress.
>>>>>>
>>>>>> Note that the gpu lockup are considered exceptional event, the implicit
>>>>>> timeline will probably want to continue on other job on other hardware
>>>>>> block but the explicit one probably will want to decide wether to 
>>>>>> continue
>>>>>> or abort or retry without the fault hw block.
>>>>>>
>>>>>>
>>>>>> I realize i am late to the party and that i should have taken a serious
>>>>>> look at all this long time ago. I apologize for that and if you consider
>>>>>> this is to late then just ignore me modulo the big warning the crazyness
>>>>>> that callback will introduce an how bad things bound to happen. I am not
>>>>>> saying that bad things can not happen with what i propose just that
>>>>>> because everything happen inside the process context that is the one
>>>>>> asking/requiring synchronization there will be not interprocess kernel
>>>>>> callback (a callback that was registered by one process and that is call
>>>>>> inside another process time slice because fence signaling is happening
>>>>>> inside this other process time slice).
>>>>> So I read through it all and presuming I understand it correctly your
>>>>> proposal and what we currently have is about the same. The big difference
>>>>> is that you make a timeline a first-class object and move the callback
>>>>> queue from the fence to the timeline, which requires callers to check the
>>>>> fence/seqno/whatever themselves instead of pushing that responsibility to
>>>>> callers.
>>>> No, big difference is that there is no callback thus when waiting for a
>>>> fence you are either inside the process context that need to wait for it
>>>> or your inside a kernel thread process context. Which means in both case
>>>> you can do whatever you want. What i hate about the fence code as it is,
>>>> is the callback stuff, because you never know into which context fence
>>>> are signaled then you never know into which context callback are executed.
>>> Look at waitqueues a bit closer. They're implemented with callbacks ;-)
>>> The only difference is that you're allowed to have spurious wakeups and
>>> need to handle that somehow, so need a separate check function.
>> No this not how wait queue are implemented, ie wait queue do not callback a
>> random function from a random driver, it callback a limited set of function
>> from core linux kernel scheduler so that the process thread that was waiting
>> and out of the scheduler list is added back and marked as something that
>> should be schedule. Unless this part of the kernel drasticly changed for the
>> worse recently.
>>
>> So this is fundamentaly different, fence as they are now allow random driver
>> callback and this is bound to get ugly this is bound to lead to one driver
>> doing something that seems innocuous but turn out to break heavoc when call
>> from some other driver function.
> No, really, look closer.
>
> fence_default_wait adds a callback to fence_default_wait_cb, which wakes up 
> the waiting thread if the fence gets signaled.
> The callback calls wake_up_state, which calls try_to_wake up.
>
> default_wake_function, which is used by wait queues does something similar, 
> it calls try_to_wake_up.
>
> Fence now has some additional checks, but originally it was implemented as a 
> wait queue.
>
> But because of driver differences I can't implement it as a straight wait 
> queue. Some drivers may not have a reliable interrupt, so they need a custom 
> wait function. (qxl)
> Some may need to do extra flushing to get fences signaled (vmwgfx), others 
> need some locking to protect against gpu lockup races (radeon, i915??).  And 
> nouveau
> doesn't use wait queues, but rolls its own (nouveau).
>
> Fences also don't imply implicit sync, you can use explicit sync if you want.
>
> I posted a patch for this, but if you want to create an android userspace 
> fence, call
>
> struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)
>
>
> I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 
> wants to use it.
>
> ~Maarten
>

Fence, timeline and android sync points

Reply via email to