Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Keith Whitwell
Kristian Høgsberg wrote:
 On Nov 27, 2007 11:48 AM, Stephane Marchesin
 [EMAIL PROTECTED] wrote:
 On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:
 ...
 It's all delightfully simple, but I'm starting to reconsider whether
 the lockless bullet point is realistic.   Note, the drawable lock is
 gone, we always render to private back buffers and do swap buffers in
 the kernel, so I'm only concerned with the DRI lock here.  The idea
 is that since we have the memory manager and the super-ioctl and the X
 server now can push cliprects into the kernel in one atomic operation,
 we would be able to get rid of the DRI lock.  My overall question,
 here is, is that feasible?
 How do you plan to ensure that X didn't change the cliprects after you
 emitted them to the DRM ?
 
 The idea was that the buffer swap happens in the kernel, triggered by
 an ioctl. The kernel generates the command stream to execute the swap
 against the current set of cliprects.  The back buffers are always
 private so the cliprects only come into play when copying from the
 back buffer to the front buffer.  Single buffered visuals are secretly
 double buffered and implemented the same way.
 
 I'm trying to figure now whether it makes more sense to keep cliprects
 and swapbuffer out of the kernel, which wouldn't change the above
 much, except the swapbuffer case.  I described the idea for swapbuffer
 in this case in my reply to Thomas: the X server publishes cliprects
 to the clients through a shared ring buffer, and clients parse the
 clip rect changes out of this buffer as they need it.  When posting a
 swap buffer request, the buffer head should be included in the
 super-ioctl so that the kernel can reject stale requests.  When that
 happens, the client must parse the new cliprect info and resubmit an
 updated swap buffer request.

In my ideal world, the entity which knows and cares about cliprects 
should be the one that does the swapbuffers, or at least is in control 
of the process.  That entity is the X server.

Instead of tying ourselves into knots trying to figure out how to get 
some other entity a sufficiently up-to-date set of cliprects to make 
this work (which is what was wrong with DRI 1.0), maybe we should try 
and figure out how to get the X server to efficiently orchestrate 
swapbuffers.

In particular it seems like we have:

1) The X server knows about cliprects.
2) The kernel knows about IRQ reception.
3) The kernel knows how to submit rendering commands to hardware.
4) Userspace is where we want to craft rendering commands.

Given the above, what do we think about swapbuffers:

a) Swapbuffers is a rendering command
b) which depends on cliprect information
c) that needs to be fired as soon as possible after an IRQ receipt.

So:
swapbuffers should be crafted from userspace (a, 4)
... by the X server (b, 1)
... and should be actually fired by the kernel (c, 2, 3)


I propose something like:

0) 3D client submits rendering to the kernel and receives back a fence.

1) 3D client wants to do swapbuffers.  It sends a message to the X 
server asking it please do me a swapbuffers after this fence has 
completed.

2) X server crafts (somehow) commands implementing swapbuffers for this 
drawable under the current set of cliprects and passes it to the kernel 
along with the fence.

3) The kernel keeps that batchbuffer to the side until
a) the commands associated with the fence have been submitted to 
hardware.
b) the next vblank IRQ arrives.

when both of these are true, the kernel simply submits the prepared 
swapbuffer commands through the lowest latency path to hardware.

But what happens if the cliprects change?  The 100% perfect solution 
looks like:

The X server knows all about cliprect changes, and can use fences or 
other mechanisms to keep track of which swapbuffers are outstanding.  At 
the time of a cliprect change, it must create new swapbuffer commandsets 
for all pending swapbuffers and re-submit those to the kernel.

These new sets of commands must be tied to the progress of the X 
server's own rendering command stream so that the kernel fires the 
appropriate one to land the swapbuffers to the correct destination as 
the X server's own rendering flies by.

In the simplest case, where the kernel puts commands onto the one true 
ring as it receives them, the kernel can simply discard the old 
swapbuffer command.  Indeed this is true also if the kernel has a 
ring-per-context and uses one of those rings to serialize the X server 
rendering and swapbuffers commands.

Note that condition 3a) above is always true in the current i915.o 
one-true-ring/single-fifo approach to hardware serialization.

I think the above can work and seems more straight-forward than many of 
the proposed alternatives.

Keith



-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from 

Re: Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Stephane Marchesin
On 11/28/07, Keith Whitwell [EMAIL PROTECTED] wrote:


 In my ideal world, the entity which knows and cares about cliprects
 should be the one that does the swapbuffers, or at least is in control
 of the process.  That entity is the X server.

 Instead of tying ourselves into knots trying to figure out how to get
 some other entity a sufficiently up-to-date set of cliprects to make
 this work (which is what was wrong with DRI 1.0), maybe we should try
 and figure out how to get the X server to efficiently orchestrate
 swapbuffers.

 In particular it seems like we have:

 1) The X server knows about cliprects.
 2) The kernel knows about IRQ reception.
 3) The kernel knows how to submit rendering commands to hardware.
 4) Userspace is where we want to craft rendering commands.

 Given the above, what do we think about swapbuffers:

 a) Swapbuffers is a rendering command
 b) which depends on cliprect information
 c) that needs to be fired as soon as possible after an IRQ
 receipt.

 So:
 swapbuffers should be crafted from userspace (a, 4)
 ... by the X server (b, 1)
 ... and should be actually fired by the kernel (c, 2, 3)


Well, on nvidia hw, you don't even need to fire from the kernel (thanks to a
special fifo command that waits for vsync).
So I'd love it if going through the kernel for swapbuffers was abstracted by
the interface.

I propose something like:

 0) 3D client submits rendering to the kernel and receives back a fence.

 1) 3D client wants to do swapbuffers.  It sends a message to the X
 server asking it please do me a swapbuffers after this fence has
 completed.

 2) X server crafts (somehow) commands implementing swapbuffers for this
 drawable under the current set of cliprects and passes it to the kernel
 along with the fence.

 3) The kernel keeps that batchbuffer to the side until
 a) the commands associated with the fence have been submitted to
 hardware.
 b) the next vblank IRQ arrives.

 when both of these are true, the kernel simply submits the prepared
 swapbuffer commands through the lowest latency path to hardware.

 But what happens if the cliprects change?  The 100% perfect solution
 looks like:

 The X server knows all about cliprect changes, and can use fences or
 other mechanisms to keep track of which swapbuffers are outstanding.  At
 the time of a cliprect change, it must create new swapbuffer commandsets
 for all pending swapbuffers and re-submit those to the kernel.

 These new sets of commands must be tied to the progress of the X
 server's own rendering command stream so that the kernel fires the
 appropriate one to land the swapbuffers to the correct destination as
 the X server's own rendering flies by.


Yes that was the basis for my thinking as well. By inserting the swapbuffers
into the normal flow of X commands, we remove the need for syncing with the
X server at swapbuffer time.


In the simplest case, where the kernel puts commands onto the one true
 ring as it receives them, the kernel can simply discard the old
 swapbuffer command.  Indeed this is true also if the kernel has a
 ring-per-context and uses one of those rings to serialize the X server
 rendering and swapbuffers commands.


Come on, admit that's a hack to get 100'000 fps in glxgears :)


Note that condition 3a) above is always true in the current i915.o
 one-true-ring/single-fifo approach to hardware serialization.


Yes, I think those details of how to wait should be left driver-dependent
and abstracted in user space. So that we have the choice of calling the
kernel, doing it from user space only, relying on a single fifo or whatever.


I think the above can work and seems more straight-forward than many of
 the proposed alternatives.


This is what I want to do too. Especially since in the nvidia case we don't
have the issue of routing vblank interrupts to user space for that.

So, the only issue I'm worried about is the latency induced by this
approach. When the DRM does the swaps you can ensure it'll get executed
pretty fast. If X has been stuffing piles of commands into its command
buffer, it might not be so fast. What this means is that 3D might be slowed
down by 2D rendering (think especially of the case of EXA fallbacks which
will sync your fifo). In that case, ensuring a no-fallback EXA would become
relevant in achieving smooth 3D performance. But at least it solves the
issue of sluggish OpenGL window moves and resizes (/me looks at the nvidia
binary driver behaviour).

Stephane
-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net

Re: Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Keith Whitwell
Stephane Marchesin wrote:
 On 11/28/07, *Keith Whitwell* [EMAIL PROTECTED] 
 mailto:[EMAIL PROTECTED] wrote:
 
 
 In my ideal world, the entity which knows and cares about cliprects
 should be the one that does the swapbuffers, or at least is in control
 of the process.  That entity is the X server.
 
 Instead of tying ourselves into knots trying to figure out how to get
 some other entity a sufficiently up-to-date set of cliprects to make
 this work (which is what was wrong with DRI 1.0), maybe we should try
 and figure out how to get the X server to efficiently orchestrate
 swapbuffers.
 
 In particular it seems like we have:
 
 1) The X server knows about cliprects.
 2) The kernel knows about IRQ reception.
 3) The kernel knows how to submit rendering commands to hardware.
 4) Userspace is where we want to craft rendering commands.
 
 Given the above, what do we think about swapbuffers:
 
 a) Swapbuffers is a rendering command
 b) which depends on cliprect information
 c) that needs to be fired as soon as possible after an IRQ
 receipt.
 
 So:
 swapbuffers should be crafted from userspace (a, 4)
 ... by the X server (b, 1)
 ... and should be actually fired by the kernel (c, 2, 3)
 
 
 Well, on nvidia hw, you don't even need to fire from the kernel (thanks 
 to a special fifo command that waits for vsync).
 So I'd love it if going through the kernel for swapbuffers was 
 abstracted by the interface.

As I suggested elsewhere, I think that you're probably going to need 
this even on nvidia hardware.

 I propose something like:
 
 0) 3D client submits rendering to the kernel and receives back a fence.
 
 1) 3D client wants to do swapbuffers.  It sends a message to the X
 server asking it please do me a swapbuffers after this fence has
 completed.
 
 2) X server crafts (somehow) commands implementing swapbuffers for this
 drawable under the current set of cliprects and passes it to the kernel
 along with the fence.
 
 3) The kernel keeps that batchbuffer to the side until
 a) the commands associated with the fence have been
 submitted to hardware.
 b) the next vblank IRQ arrives.
 
 when both of these are true, the kernel simply submits the prepared
 swapbuffer commands through the lowest latency path to hardware.
 
 But what happens if the cliprects change?  The 100% perfect solution
 looks like:
 
 The X server knows all about cliprect changes, and can use fences or
 other mechanisms to keep track of which swapbuffers are
 outstanding.  At
 the time of a cliprect change, it must create new swapbuffer commandsets
 for all pending swapbuffers and re-submit those to the kernel.
 
 These new sets of commands must be tied to the progress of the X
 server's own rendering command stream so that the kernel fires the
 appropriate one to land the swapbuffers to the correct destination as
 the X server's own rendering flies by.
 
 
 Yes that was the basis for my thinking as well. By inserting the 
 swapbuffers into the normal flow of X commands, we remove the need for 
 syncing with the X server at swapbuffer time.

The very simplest way would be just to have the X server queue it up 
like normal blits and not even involve the kernel.  I'm not proposing 
this.  I believe such an approach will fail for the sync-to-vblank case 
due to latency issues - even (I suspect) with hardware-wait-for-vblank.

Rather, I'm describing a mechanism that allows a pre-prepared swapbuffer 
command to be injected into the X command stream (one way or another) 
with the guarantee that it will encode the correct cliprects, but which 
will avoid stalling the command queue in the meantime.


 In the simplest case, where the kernel puts commands onto the one true
 ring as it receives them, the kernel can simply discard the old
 swapbuffer command.  Indeed this is true also if the kernel has a
 ring-per-context and uses one of those rings to serialize the X server
 rendering and swapbuffers commands. 
 
 
 Come on, admit that's a hack to get 100'000 fps in glxgears :)

I'm not talking about discarding the whole swap operation, just the 
version of the swap command buffer that pertained to the old cliprects. 
  Every swap is still being performed.

You do raise a good point though -- we currently throttle the 3d driver 
based on swapbuffer fences.  There would need to be some equivalent 
mechanism to achieve this.

 
 Note that condition 3a) above is always true in the current i915.o
 one-true-ring/single-fifo approach to hardware serialization.
 
 
 Yes, I think those details of how to wait should be left 
 driver-dependent and abstracted in user space. So that we have the 
 choice of calling the kernel, doing it from user space only, relying on 
 a single fifo or 

Re: Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Jerome Glisse
On Wed, Nov 28, 2007 at 12:43:18PM +0100, Stephane Marchesin wrote:
 
 This is what I want to do too. Especially since in the nvidia case we don't
 have the issue of routing vblank interrupts to user space for that.
 
 So, the only issue I'm worried about is the latency induced by this
 approach. When the DRM does the swaps you can ensure it'll get executed
 pretty fast. If X has been stuffing piles of commands into its command
 buffer, it might not be so fast. What this means is that 3D might be slowed
 down by 2D rendering (think especially of the case of EXA fallbacks which
 will sync your fifo). In that case, ensuring a no-fallback EXA would become
 relevant in achieving smooth 3D performance. But at least it solves the
 issue of sluggish OpenGL window moves and resizes (/me looks at the nvidia
 binary driver behaviour).
 
 Stephane

I likely got problem with my mail as i think my previous mail didn't get
through. Anyway i am all for having X server responsible for swapping
buffer. For solving a part of the above problem we might have two context
(fifo) for X server: one for X drawing, one for X swapping buffer.

The swap context (fifo) is a top priority things and should preempt others
context (fifo). An outcome of this is that we might like a simple gpu
scheduler for such case (and maybe other in the future) but obviously such
scheduler will be highly hw dependent.

Cheers,
Jerome Glisse

-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel