subject:"DRI2 and lock\-less operation"

Re: DRI2 and lock-less operation

2007-11-28 Thread Keith Whitwell

Kristian Høgsberg wrote:
 On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote:
 ...
   I both cases, the kernel will need to
 know how to activate a given context and the context handle should be
 part of the super ioctl arguments.
 We needn't expose the contexts to user-space, just provide a virtual
 consistent device and manage contexts in the kernel. We could add the
 ability to manage contexts from user space for cases where that makes
 sense (like, perhaps, in the X server where a context per client may be
 useful).
 
 Oh, right we don't need one per GLContext, just per DRI client, mesa
 handles switching between GL contexts.  What about multithreaded
 rendering sharing the same drm fd?
 
 I imagine one optimiation you could do with a fixed number of contexts
 is to assume that loosing the context will be very rare, and just fail
 the super-ioctl when it happens, and then expect user space to
 resubmit with state emission preamble.  In fact it may work well for
 single context hardware...
 I recall having the same discussion in the past; have the superioctl
 fail so that the client needn't constantly compute the full state
 restore on every submission may be a performance win for some hardware.
 All that this requires is a flag to the kernel that says 'this
 submission reinitializes the hardware', and an error return that says
 'lost context'.
 
 Exactly.
 
 But the super-ioctl is chipset specific and we can decide on the
 details there on a chipset to chipset basis.  If you have input to how
 the super-ioctl for intel should look like to support lockless
 operation for current and future intel chipset, I'd love to hear it.
 And of course we can version our way out of trouble as a last resort.
 We should do the lockless and context stuff at the same time; doing
 re-submit would be a bunch of work in the driver that would be wasted.
 
 Is it that bad? We will still need the resubmit code for older
 chipsets that dont have the hardware context support.  The drivers
 already have the code to emit state in case of context loss, that code
 just needs to be repurposed to generate a batch buffer to prepend to
 the rendering code.  And the rendering code doesn't need to change
 when resubmitting.  Will that work?
 
 Right now, we're just trying to get 965 running with ttm; once that's
 limping along, figuring out how to make it reasonable will be the next
 task, and hardware context support is clearly a big part of that.
 
 Yeah - I'm trying to limit the scope of DRI2 so that we can have
 redirected direct rendering and GLX1.4 in the tree sooner rather than
 later (before the end of the year).  Maybe the best way to do that is
 to keep the lock around for now and punt on the lock-less super-ioctl
 if that really blocks on hardware context support.  How far back is
 hardware contexts supported?

There are three ways to support lockless operation
- hardware contexts
- a full preamble emit per batchbuffer
- passing a pair of preamble, payload batchbuffers per ioctl

I think all hardware is capable of supporting at least one of these.

That said, if the super-ioctl is per-device, then you can make a choice 
per-device in terms of whether the lock is required or not, which makes 
things easier.  The reality is that most ttm based drivers will do very 
little differently on a regular lock compared to a contended one -- at 
most they could decide whether or not to emit a preamble they computed 
earlier.

Keith




-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Keith Whitwell

Kristian Høgsberg wrote:

 Another problem is that it's not just swapbuffer - anything that
 touches the front buffer have to respect the cliprects - glCopyPixels
 and glXCopySubBufferMESA - and thus have to happen in the kernel.

These don't touch the real swapbuffer, just the fake one.  Hence they 
don't care about cliprects and certainly don't have to happen in the 
kernel...

Keith



-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Michel Dänzer


On Wed, 2007-11-28 at 09:30 +, Keith Whitwell wrote:
 Kristian Høgsberg wrote:
 
  Another problem is that it's not just swapbuffer - anything that
  touches the front buffer have to respect the cliprects - glCopyPixels
  and glXCopySubBufferMESA - and thus have to happen in the kernel.
 
 These don't touch the real swapbuffer, just the fake one.  Hence they 
 don't care about cliprects and certainly don't have to happen in the 
 kernel...

I'm not sure about glCopyPixels, but glXCopySubBufferMESA would most
definitely be useless if it didn't copy to the real frontbuffer.


-- 
Earthling Michel Dänzer   |  http://tungstengraphics.com
Libre software enthusiast |  Debian, X and DRI developer


-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Keith Whitwell

Michel Dänzer wrote:
 On Wed, 2007-11-28 at 09:30 +, Keith Whitwell wrote:
 Kristian Høgsberg wrote:

 Another problem is that it's not just swapbuffer - anything that
 touches the front buffer have to respect the cliprects - glCopyPixels
 and glXCopySubBufferMESA - and thus have to happen in the kernel.
 These don't touch the real swapbuffer, just the fake one.  Hence they 
 don't care about cliprects and certainly don't have to happen in the 
 kernel...
 
 I'm not sure about glCopyPixels, but glXCopySubBufferMESA would most
 definitely be useless if it didn't copy to the real frontbuffer.

Yes, wasn't paying attention...  glxCopySubBufferMESA would do both - 
copy to the fake front buffer and then trigger a damage-induced update 
of the real frontbuffer.  Neither operation requires the 3d driver know 
about cliprects, and the damage operation is basically a generalization 
of the swapbuffer stuff we've been talking about.

Keith


-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Keith Whitwell

Kristian Høgsberg wrote:
 On Nov 27, 2007 11:48 AM, Stephane Marchesin
 [EMAIL PROTECTED] wrote:
 On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:
 ...
 It's all delightfully simple, but I'm starting to reconsider whether
 the lockless bullet point is realistic.   Note, the drawable lock is
 gone, we always render to private back buffers and do swap buffers in
 the kernel, so I'm only concerned with the DRI lock here.  The idea
 is that since we have the memory manager and the super-ioctl and the X
 server now can push cliprects into the kernel in one atomic operation,
 we would be able to get rid of the DRI lock.  My overall question,
 here is, is that feasible?
 How do you plan to ensure that X didn't change the cliprects after you
 emitted them to the DRM ?
 
 The idea was that the buffer swap happens in the kernel, triggered by
 an ioctl. The kernel generates the command stream to execute the swap
 against the current set of cliprects.  The back buffers are always
 private so the cliprects only come into play when copying from the
 back buffer to the front buffer.  Single buffered visuals are secretly
 double buffered and implemented the same way.
 
 I'm trying to figure now whether it makes more sense to keep cliprects
 and swapbuffer out of the kernel, which wouldn't change the above
 much, except the swapbuffer case.  I described the idea for swapbuffer
 in this case in my reply to Thomas: the X server publishes cliprects
 to the clients through a shared ring buffer, and clients parse the
 clip rect changes out of this buffer as they need it.  When posting a
 swap buffer request, the buffer head should be included in the
 super-ioctl so that the kernel can reject stale requests.  When that
 happens, the client must parse the new cliprect info and resubmit an
 updated swap buffer request.

In my ideal world, the entity which knows and cares about cliprects 
should be the one that does the swapbuffers, or at least is in control 
of the process.  That entity is the X server.

Instead of tying ourselves into knots trying to figure out how to get 
some other entity a sufficiently up-to-date set of cliprects to make 
this work (which is what was wrong with DRI 1.0), maybe we should try 
and figure out how to get the X server to efficiently orchestrate 
swapbuffers.

In particular it seems like we have:

1) The X server knows about cliprects.
2) The kernel knows about IRQ reception.
3) The kernel knows how to submit rendering commands to hardware.
4) Userspace is where we want to craft rendering commands.

Given the above, what do we think about swapbuffers:

a) Swapbuffers is a rendering command
b) which depends on cliprect information
c) that needs to be fired as soon as possible after an IRQ receipt.

So:
swapbuffers should be crafted from userspace (a, 4)
... by the X server (b, 1)
... and should be actually fired by the kernel (c, 2, 3)


I propose something like:

0) 3D client submits rendering to the kernel and receives back a fence.

1) 3D client wants to do swapbuffers.  It sends a message to the X 
server asking it please do me a swapbuffers after this fence has 
completed.

2) X server crafts (somehow) commands implementing swapbuffers for this 
drawable under the current set of cliprects and passes it to the kernel 
along with the fence.

3) The kernel keeps that batchbuffer to the side until
a) the commands associated with the fence have been submitted to 
hardware.
b) the next vblank IRQ arrives.

when both of these are true, the kernel simply submits the prepared 
swapbuffer commands through the lowest latency path to hardware.

But what happens if the cliprects change?  The 100% perfect solution 
looks like:

The X server knows all about cliprect changes, and can use fences or 
other mechanisms to keep track of which swapbuffers are outstanding.  At 
the time of a cliprect change, it must create new swapbuffer commandsets 
for all pending swapbuffers and re-submit those to the kernel.

These new sets of commands must be tied to the progress of the X 
server's own rendering command stream so that the kernel fires the 
appropriate one to land the swapbuffers to the correct destination as 
the X server's own rendering flies by.

In the simplest case, where the kernel puts commands onto the one true 
ring as it receives them, the kernel can simply discard the old 
swapbuffer command.  Indeed this is true also if the kernel has a 
ring-per-context and uses one of those rings to serialize the X server 
rendering and swapbuffers commands.

Note that condition 3a) above is always true in the current i915.o 
one-true-ring/single-fifo approach to hardware serialization.

I think the above can work and seems more straight-forward than many of 
the proposed alternatives.

Keith



-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from

Re: DRI2 and lock-less operation

2007-11-28 Thread Stephane Marchesin

On 11/28/07, Keith Packard [EMAIL PROTECTED] wrote:


 On Wed, 2007-11-28 at 00:52 +0100, Stephane Marchesin wrote:


  The third case obviously will never require any kind of state
  re-emitting.

 Unless you run out of hardware contexts.


Well, in that case we (plan to) bang the state registers from the kernel
directly and do a manual state swap.
So we still don't need state re-emitting.

Stephane
-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Michel Dänzer


On Tue, 2007-11-27 at 16:50 -0500, Kristian Høgsberg wrote:
 
 [...] I'm starting to doubt the cliprects and swapbuffer
 in the kernel design anyhow.  I wasn't going this route originally,
 but since we already had that in place for i915 vblank buffer swaps, I
 figured we might as well go that route.  I'm not sure the buffer swap
 from the vblank tasklet is really necessary to begin with... are there
 benchmarks showing that the irq-userspace latency was too big for
 this to work in userspace?

Yes, I implemented this on an i945 system where waiting for vertical
blank and then submitting the buffer swap blit from userspace resulted
in constant tearing.


-- 
Earthling Michel Dänzer   |  http://tungstengraphics.com
Libre software enthusiast |  Debian, X and DRI developer


-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Michel Dänzer


On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote:
 On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote:
 ...
 I both cases, the kernel will need to
   know how to activate a given context and the context handle should be
   part of the super ioctl arguments.
 
  We needn't expose the contexts to user-space, just provide a virtual
  consistent device and manage contexts in the kernel. We could add the
  ability to manage contexts from user space for cases where that makes
  sense (like, perhaps, in the X server where a context per client may be
  useful).
 
 Oh, right we don't need one per GLContext, just per DRI client, mesa
 handles switching between GL contexts.  What about multithreaded
 rendering sharing the same drm fd?

That's not supported, every GLX context has its own DRM context.


-- 
Earthling Michel Dänzer   |  http://tungstengraphics.com
Libre software enthusiast |  Debian, X and DRI developer


-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Stephane Marchesin

On 11/28/07, Keith Whitwell [EMAIL PROTECTED] wrote:


 In my ideal world, the entity which knows and cares about cliprects
 should be the one that does the swapbuffers, or at least is in control
 of the process.  That entity is the X server.

 Instead of tying ourselves into knots trying to figure out how to get
 some other entity a sufficiently up-to-date set of cliprects to make
 this work (which is what was wrong with DRI 1.0), maybe we should try
 and figure out how to get the X server to efficiently orchestrate
 swapbuffers.

 In particular it seems like we have:

 1) The X server knows about cliprects.
 2) The kernel knows about IRQ reception.
 3) The kernel knows how to submit rendering commands to hardware.
 4) Userspace is where we want to craft rendering commands.

 Given the above, what do we think about swapbuffers:

 a) Swapbuffers is a rendering command
 b) which depends on cliprect information
 c) that needs to be fired as soon as possible after an IRQ
 receipt.

 So:
 swapbuffers should be crafted from userspace (a, 4)
 ... by the X server (b, 1)
 ... and should be actually fired by the kernel (c, 2, 3)


Well, on nvidia hw, you don't even need to fire from the kernel (thanks to a
special fifo command that waits for vsync).
So I'd love it if going through the kernel for swapbuffers was abstracted by
the interface.

I propose something like:

 0) 3D client submits rendering to the kernel and receives back a fence.

 1) 3D client wants to do swapbuffers.  It sends a message to the X
 server asking it please do me a swapbuffers after this fence has
 completed.

 2) X server crafts (somehow) commands implementing swapbuffers for this
 drawable under the current set of cliprects and passes it to the kernel
 along with the fence.

 3) The kernel keeps that batchbuffer to the side until
 a) the commands associated with the fence have been submitted to
 hardware.
 b) the next vblank IRQ arrives.

 when both of these are true, the kernel simply submits the prepared
 swapbuffer commands through the lowest latency path to hardware.

 But what happens if the cliprects change?  The 100% perfect solution
 looks like:

 The X server knows all about cliprect changes, and can use fences or
 other mechanisms to keep track of which swapbuffers are outstanding.  At
 the time of a cliprect change, it must create new swapbuffer commandsets
 for all pending swapbuffers and re-submit those to the kernel.

 These new sets of commands must be tied to the progress of the X
 server's own rendering command stream so that the kernel fires the
 appropriate one to land the swapbuffers to the correct destination as
 the X server's own rendering flies by.


Yes that was the basis for my thinking as well. By inserting the swapbuffers
into the normal flow of X commands, we remove the need for syncing with the
X server at swapbuffer time.


In the simplest case, where the kernel puts commands onto the one true
 ring as it receives them, the kernel can simply discard the old
 swapbuffer command.  Indeed this is true also if the kernel has a
 ring-per-context and uses one of those rings to serialize the X server
 rendering and swapbuffers commands.


Come on, admit that's a hack to get 100'000 fps in glxgears :)


Note that condition 3a) above is always true in the current i915.o
 one-true-ring/single-fifo approach to hardware serialization.


Yes, I think those details of how to wait should be left driver-dependent
and abstracted in user space. So that we have the choice of calling the
kernel, doing it from user space only, relying on a single fifo or whatever.


I think the above can work and seems more straight-forward than many of
 the proposed alternatives.


This is what I want to do too. Especially since in the nvidia case we don't
have the issue of routing vblank interrupts to user space for that.

So, the only issue I'm worried about is the latency induced by this
approach. When the DRM does the swaps you can ensure it'll get executed
pretty fast. If X has been stuffing piles of commands into its command
buffer, it might not be so fast. What this means is that 3D might be slowed
down by 2D rendering (think especially of the case of EXA fallbacks which
will sync your fifo). In that case, ensuring a no-fallback EXA would become
relevant in achieving smooth 3D performance. But at least it solves the
issue of sluggish OpenGL window moves and resizes (/me looks at the nvidia
binary driver behaviour).

Stephane
-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net

Re: Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Keith Whitwell

Stephane Marchesin wrote:
 On 11/28/07, *Keith Whitwell* [EMAIL PROTECTED] 
 mailto:[EMAIL PROTECTED] wrote:
 
 
 In my ideal world, the entity which knows and cares about cliprects
 should be the one that does the swapbuffers, or at least is in control
 of the process.  That entity is the X server.
 
 Instead of tying ourselves into knots trying to figure out how to get
 some other entity a sufficiently up-to-date set of cliprects to make
 this work (which is what was wrong with DRI 1.0), maybe we should try
 and figure out how to get the X server to efficiently orchestrate
 swapbuffers.
 
 In particular it seems like we have:
 
 1) The X server knows about cliprects.
 2) The kernel knows about IRQ reception.
 3) The kernel knows how to submit rendering commands to hardware.
 4) Userspace is where we want to craft rendering commands.
 
 Given the above, what do we think about swapbuffers:
 
 a) Swapbuffers is a rendering command
 b) which depends on cliprect information
 c) that needs to be fired as soon as possible after an IRQ
 receipt.
 
 So:
 swapbuffers should be crafted from userspace (a, 4)
 ... by the X server (b, 1)
 ... and should be actually fired by the kernel (c, 2, 3)
 
 
 Well, on nvidia hw, you don't even need to fire from the kernel (thanks 
 to a special fifo command that waits for vsync).
 So I'd love it if going through the kernel for swapbuffers was 
 abstracted by the interface.

As I suggested elsewhere, I think that you're probably going to need 
this even on nvidia hardware.

 I propose something like:
 
 0) 3D client submits rendering to the kernel and receives back a fence.
 
 1) 3D client wants to do swapbuffers.  It sends a message to the X
 server asking it please do me a swapbuffers after this fence has
 completed.
 
 2) X server crafts (somehow) commands implementing swapbuffers for this
 drawable under the current set of cliprects and passes it to the kernel
 along with the fence.
 
 3) The kernel keeps that batchbuffer to the side until
 a) the commands associated with the fence have been
 submitted to hardware.
 b) the next vblank IRQ arrives.
 
 when both of these are true, the kernel simply submits the prepared
 swapbuffer commands through the lowest latency path to hardware.
 
 But what happens if the cliprects change?  The 100% perfect solution
 looks like:
 
 The X server knows all about cliprect changes, and can use fences or
 other mechanisms to keep track of which swapbuffers are
 outstanding.  At
 the time of a cliprect change, it must create new swapbuffer commandsets
 for all pending swapbuffers and re-submit those to the kernel.
 
 These new sets of commands must be tied to the progress of the X
 server's own rendering command stream so that the kernel fires the
 appropriate one to land the swapbuffers to the correct destination as
 the X server's own rendering flies by.
 
 
 Yes that was the basis for my thinking as well. By inserting the 
 swapbuffers into the normal flow of X commands, we remove the need for 
 syncing with the X server at swapbuffer time.

The very simplest way would be just to have the X server queue it up 
like normal blits and not even involve the kernel.  I'm not proposing 
this.  I believe such an approach will fail for the sync-to-vblank case 
due to latency issues - even (I suspect) with hardware-wait-for-vblank.

Rather, I'm describing a mechanism that allows a pre-prepared swapbuffer 
command to be injected into the X command stream (one way or another) 
with the guarantee that it will encode the correct cliprects, but which 
will avoid stalling the command queue in the meantime.


 In the simplest case, where the kernel puts commands onto the one true
 ring as it receives them, the kernel can simply discard the old
 swapbuffer command.  Indeed this is true also if the kernel has a
 ring-per-context and uses one of those rings to serialize the X server
 rendering and swapbuffers commands. 
 
 
 Come on, admit that's a hack to get 100'000 fps in glxgears :)

I'm not talking about discarding the whole swap operation, just the 
version of the swap command buffer that pertained to the old cliprects. 
  Every swap is still being performed.

You do raise a good point though -- we currently throttle the 3d driver 
based on swapbuffer fences.  There would need to be some equivalent 
mechanism to achieve this.

 
 Note that condition 3a) above is always true in the current i915.o
 one-true-ring/single-fifo approach to hardware serialization.
 
 
 Yes, I think those details of how to wait should be left 
 driver-dependent and abstracted in user space. So that we have the 
 choice of calling the kernel, doing it from user space only, relying on 
 a single fifo or

Re: Swapbuffers [was: Re: DRI2 and lock-less operation]

2007-11-28 Thread Jerome Glisse

On Wed, Nov 28, 2007 at 12:43:18PM +0100, Stephane Marchesin wrote:
 
 This is what I want to do too. Especially since in the nvidia case we don't
 have the issue of routing vblank interrupts to user space for that.
 
 So, the only issue I'm worried about is the latency induced by this
 approach. When the DRM does the swaps you can ensure it'll get executed
 pretty fast. If X has been stuffing piles of commands into its command
 buffer, it might not be so fast. What this means is that 3D might be slowed
 down by 2D rendering (think especially of the case of EXA fallbacks which
 will sync your fifo). In that case, ensuring a no-fallback EXA would become
 relevant in achieving smooth 3D performance. But at least it solves the
 issue of sluggish OpenGL window moves and resizes (/me looks at the nvidia
 binary driver behaviour).
 
 Stephane

I likely got problem with my mail as i think my previous mail didn't get
through. Anyway i am all for having X server responsible for swapping
buffer. For solving a part of the above problem we might have two context
(fifo) for X server: one for X drawing, one for X swapping buffer.

The swap context (fifo) is a top priority things and should preempt others
context (fifo). An outcome of this is that we might like a simple gpu
scheduler for such case (and maybe other in the future) but obviously such
scheduler will be highly hw dependent.

Cheers,
Jerome Glisse

-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Thomas Hellström

Michel Dänzer wrote:

On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote:
  

On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote:
...


  I both cases, the kernel will need to
know how to activate a given context and the context handle should be
part of the super ioctl arguments.


We needn't expose the contexts to user-space, just provide a virtual
consistent device and manage contexts in the kernel. We could add the
ability to manage contexts from user space for cases where that makes
sense (like, perhaps, in the X server where a context per client may be
useful).
  

Oh, right we don't need one per GLContext, just per DRI client, mesa
handles switching between GL contexts.  What about multithreaded
rendering sharing the same drm fd?



That's not supported, every GLX context has its own DRM context.


  

It needs to be. DRI applications should have a single fd open to drm, 
otherwise shared buffer operation will be problematic.

/Thomas






-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Kristian Høgsberg

On Nov 26, 2007 6:51 PM, Jerome Glisse [EMAIL PROTECTED] wrote:
...
  Sounds like this will all work out after all :)
  Kristian

 Yes, between i started looking at your dri2 branch and didn't find dri2
 branch in your intel ddx repository did you pushed the code anywhere else ?
 Would help me to see what i need to do for dri2 in ddx.

I didn't push that yet, but I've attached the patch as it looks right
now.  It's still very much work-in-progress.

Kristian
commit 3a16f83c278902ff673bc21272a3b95eb6237dab
Author: Kristian HÃ¸gsberg [EMAIL PROTECTED]
Date:   Mon Nov 19 17:31:02 2007 -0500

dri2 wip

diff --git a/src/i830.h b/src/i830.h
index 57f0544..ba20df8 100644
--- a/src/i830.h
+++ b/src/i830.h
@@ -295,6 +295,12 @@ enum last_3d {
 LAST_3D_ROTATION
 };
 
+enum dri_type {
+DRI_TYPE_NONE,
+DRI_TYPE_XF86DRI,
+DRI_TYPE_DRI2
+};
+
 typedef struct _I830Rec {
unsigned char *MMIOBase;
unsigned char *GTTBase;
@@ -456,7 +462,7 @@ typedef struct _I830Rec {
CARD32 samplerstate[6];
 
Bool directRenderingDisabled;	/* DRI disabled in PreInit. */
-   Bool directRenderingEnabled;		/* DRI enabled this generation. */
+   enum dri_type directRendering;	/* Type of DRI enabled this generation. */
 
 #ifdef XF86DRI
Bool directRenderingOpen;
diff --git a/src/i830_accel.c b/src/i830_accel.c
index 4d9ea79..e38897e 100644
--- a/src/i830_accel.c
+++ b/src/i830_accel.c
@@ -136,7 +136,7 @@ I830WaitLpRing(ScrnInfoPtr pScrn, int n, int timeout_millis)
 	 i830_dump_error_state(pScrn);
 	 ErrorF(space: %d wanted %d\n, ring-space, n);
 #ifdef XF86DRI
-	 if (pI830-directRenderingEnabled) {
+	 if (pI830-directRendering == DRI_TYPE_XF86DRI) {
 	DRIUnlock(screenInfo.screens[pScrn-scrnIndex]);
 	DRICloseScreen(screenInfo.screens[pScrn-scrnIndex]);
 	 }
@@ -176,7 +176,7 @@ I830Sync(ScrnInfoPtr pScrn)
 #ifdef XF86DRI
/* VT switching tries to do this.
 */
-   if (!pI830-LockHeld  pI830-directRenderingEnabled) {
+   if (!pI830-LockHeld  pI830-directRendering != DRI_TYPE_NONE) {
   return;
}
 #endif
diff --git a/src/i830_display.c b/src/i830_display.c
index d988b86..8b07c51 100644
--- a/src/i830_display.c
+++ b/src/i830_display.c
@@ -408,7 +408,7 @@ i830PipeSetBase(xf86CrtcPtr crtc, int x, int y)
 }
 
 #ifdef XF86DRI
-if (pI830-directRenderingEnabled) {
+if (pI830-directRendering == DRI_TYPE_XF86DRI) {
 	drmI830Sarea *sPriv = (drmI830Sarea *) DRIGetSAREAPrivate(pScrn-pScreen);
 
 	if (!sPriv)
@@ -755,7 +755,7 @@ i830_crtc_dpms(xf86CrtcPtr crtc, int mode)
 intel_crtc-dpms_mode = mode;
 
 #ifdef XF86DRI
-if (pI830-directRenderingEnabled) {
+if (pI830-directRendering == DRI_TYPE_XF86DRI) {
 	drmI830Sarea *sPriv = (drmI830Sarea *) DRIGetSAREAPrivate(pScrn-pScreen);
 	Bool enabled = crtc-enabled  mode != DPMSModeOff;
 
diff --git a/src/i830_dri.c b/src/i830_dri.c
index 4d3458f..511b183 100644
--- a/src/i830_dri.c
+++ b/src/i830_dri.c
@@ -1236,6 +1236,10 @@ I830DRIInitBuffers(WindowPtr pWin, RegionPtr prgn, CARD32 index)
i830MarkSync(pScrn);
 }
 
+/* FIXME: fix this for real */
+#define ALLOCATE_LOCAL(size) xalloc(size)
+#define DEALLOCATE_LOCAL(ptr) xfree(ptr)
+
 /* This routine is a modified form of XAADoBitBlt with the calls to
  * ScreenToScreenBitBlt built in. My routine has the prgnSrc as source
  * instead of destination. My origin is upside down so the ydir cases
@@ -1748,7 +1752,7 @@ I830DRISetVBlankInterrupt (ScrnInfoPtr pScrn, Bool on)
 if (!pI830-want_vblank_interrupts)
 	on = FALSE;
 
-if (pI830-directRenderingEnabled  pI830-drmMinor = 5) {
+if (pI830-directRendering == DRI_TYPE_XF86DRI  pI830-drmMinor = 5) {
 	if (on) {
 	if (xf86_config-num_crtc  1  xf86_config-crtc[1]-enabled)
 		if (pI830-drmMinor = 6)
@@ -1775,7 +1779,7 @@ I830DRILock(ScrnInfoPtr pScrn)
 {
I830Ptr pI830 = I830PTR(pScrn);
 
-   if (pI830-directRenderingEnabled  !pI830-LockHeld) {
+   if (pI830-directRendering == DRI_TYPE_XF86DRI  !pI830-LockHeld) {
   DRILock(screenInfo.screens[pScrn-scrnIndex], 0);
   pI830-LockHeld = 1;
   I830RefreshRing(pScrn);
@@ -1792,7 +1796,7 @@ I830DRIUnlock(ScrnInfoPtr pScrn)
 {
I830Ptr pI830 = I830PTR(pScrn);
 
-   if (pI830-directRenderingEnabled  pI830-LockHeld) {
+   if (pI830-directRendering == DRI_TYPE_XF86DRI  pI830-LockHeld) {
   DRIUnlock(screenInfo.screens[pScrn-scrnIndex]);
   pI830-LockHeld = 0;
}
diff --git a/src/i830_driver.c b/src/i830_driver.c
index eacaefc..08a5904 100644
--- a/src/i830_driver.c
+++ b/src/i830_driver.c
@@ -208,6 +208,10 @@ USE OR OTHER DEALINGS IN THE SOFTWARE.
 #endif
 #endif
 
+#ifdef DRI2
+#include dri2.h
+#endif
+
 #ifdef I830_USE_EXA
 const char *I830exaSymbols[] = {
 exaGetVersion,
@@ -1467,7 +1471,7 @@ I830PreInit(ScrnInfoPtr pScrn, int flags)
pI830-directRenderingDisabled =
 	!xf86ReturnOptValBool(pI830-Options, OPTION_DRI, TRUE);
 
-#ifdef XF86DRI
+#if defined(XF86DRI) || defined(DRI2)
if (!pI830-directRenderingDisabled) {
   if (pI830-noAccel

Re: DRI2 and lock-less operation

2007-11-28 Thread Michel Dänzer


On Wed, 2007-11-28 at 15:07 +0100, Thomas Hellström wrote:
 Michel Dänzer wrote:
 
 On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote:
   
 
 On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote:
 ...
 
 
   I both cases, the kernel will need to
 know how to activate a given context and the context handle should be
 part of the super ioctl arguments.
 
 
 We needn't expose the contexts to user-space, just provide a virtual
 consistent device and manage contexts in the kernel. We could add the
 ability to manage contexts from user space for cases where that makes
 sense (like, perhaps, in the X server where a context per client may be
 useful).
   
 
 Oh, right we don't need one per GLContext, just per DRI client, mesa
 handles switching between GL contexts.  What about multithreaded
 rendering sharing the same drm fd?
 
 
 
 That's not supported, every GLX context has its own DRM context.
 
 
   
 
 It needs to be. DRI applications should have a single fd open to drm, 
 otherwise shared buffer operation will be problematic.

Yeah, I misremembered and thought there was a 1:1 correspondence between
DRM contexts and fds. Sorry for the confusion.


-- 
Earthling Michel Dänzer   |  http://tungstengraphics.com
Libre software enthusiast |  Debian, X and DRI developer


-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-28 Thread Kristian Høgsberg

On Nov 28, 2007 6:15 AM, Stephane Marchesin [EMAIL PROTECTED] wrote:
 On 11/28/07, Keith Packard [EMAIL PROTECTED] wrote:
  On Wed, 2007-11-28 at 00:52 +0100, Stephane Marchesin wrote:
 
   The third case obviously will never require any kind of state
   re-emitting.
 
  Unless you run out of hardware contexts.

 Well, in that case we (plan to) bang the state registers from the kernel
 directly and do a manual state swap.
 So we still don't need state re-emitting.

Well, banging the state registers from the kernel sounds like state
emission to me - you mean userspace state emission wont be needed?.
How does the kernel know how to restore the state? Can you swap a
previously set hardware state out to vram/hostmem?  I guess you can
always just read out the registers you need...

Kristian

-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Thomas Hellström

Kristian Høgsberg wrote:

On Nov 22, 2007 4:03 AM, Thomas Hellström [EMAIL PROTECTED] wrote:
...
  

There are probably various ways to do this, which is another argument
for keeping super-ioctls device-specific.
For i915-type hardware, Dave's approach above is probably the most
attracting one.
For Poulsbo, all state is always implicitly included, usually as a
reference to a buffer object, so we don't really bother about contexts here.
For hardware like the Unichrome, the state is stored in a limited set of
registers.
Here the drm can keep a copy of those registers for each context and do
a smart update on a context switch.

However, there are cases where it is very difficult to use cliprects
from the drm, though  I wouldn't say impossible.



The idea is to keep the cliprects in the kernel and only render double
buffered.  The only code that needs to worry about cliprects is swap
buffer, either immediate or synced to vblank.  What are the cliprect
problems you mention?

Kristian
  

Hi, Kristian.
Sorry for the late response.

What I'm thinking about is the case where the swapbuffers needs to be 
done by the 3D engine, and with increasingly complex hardware this will 
at the very least mean some sort of pixel-shader code in the kernel, or 
perhaps loaded by the kernel firmware interface as a firmware snippet.

If we take Poulsbo as an example, the 2D engine is present and open, and 
swapbuffers can be done using it, but since the 2D- and 3D engines 
operate separately they are synced in software by the TTM memory manager 
fence class arbitration code, and we lose performance since we cannot 
pipeline 3D tasks as we'd want to. If the 3D engine were open, we'd 
still need a vast amount of code in the kernel to be able to just to a 
3D blit.

This is even more complicated by the fact that we may not be able to 
implement 3D blits in the kernel for IP protection reasons. Note that 
I'm just stating the problem here. I'm not arguing that this should 
influence the DRI2 design.

/Thomas






-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Stephane Marchesin

On 11/27/07, Thomas Hellström [EMAIL PROTECTED] wrote:

 Kristian Høgsberg wrote:

 On Nov 22, 2007 4:03 AM, Thomas Hellström [EMAIL PROTECTED]
 wrote:
 ...
 
 
 There are probably various ways to do this, which is another argument
 for keeping super-ioctls device-specific.
 For i915-type hardware, Dave's approach above is probably the most
 attracting one.
 For Poulsbo, all state is always implicitly included, usually as a
 reference to a buffer object, so we don't really bother about contexts
 here.
 For hardware like the Unichrome, the state is stored in a limited set of
 registers.
 Here the drm can keep a copy of those registers for each context and do
 a smart update on a context switch.
 
 However, there are cases where it is very difficult to use cliprects
 from the drm, though  I wouldn't say impossible.
 
 
 
 The idea is to keep the cliprects in the kernel and only render double
 buffered.  The only code that needs to worry about cliprects is swap
 buffer, either immediate or synced to vblank.  What are the cliprect
 problems you mention?
 
 Kristian
 
 
 Hi, Kristian.
 Sorry for the late response.

 What I'm thinking about is the case where the swapbuffers needs to be
 done by the 3D engine, and with increasingly complex hardware this will
 at the very least mean some sort of pixel-shader code in the kernel, or
 perhaps loaded by the kernel firmware interface as a firmware snippet.

 If we take Poulsbo as an example, the 2D engine is present and open, and
 swapbuffers can be done using it, but since the 2D- and 3D engines
 operate separately they are synced in software by the TTM memory manager
 fence class arbitration code, and we lose performance since we cannot
 pipeline 3D tasks as we'd want to. If the 3D engine were open, we'd
 still need a vast amount of code in the kernel to be able to just to a
 3D blit.


Then why don't you do it in user space ? You could very well do swapbuffers
in the DDX (and cliprects then become a non-issue).


This is even more complicated by the fact that we may not be able to
 implement 3D blits in the kernel for IP protection reasons. Note that
 I'm just stating the problem here. I'm not arguing that this should
 influence the DRI2 design.


Yes, I don't think we should base the open source DRI design upon specs that
have to be kept hidden. Especially if that hardware is not relevant in any
way technically.

Stephane
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Stephane Marchesin

On 11/27/07, Keith Packard [EMAIL PROTECTED] wrote:


 On Mon, 2007-11-26 at 17:15 -0500, Kristian Høgsberg wrote:

   -full state

 I assume you'll deal with hardware which supports multiple contexts and
 avoid the need to include full state with each buffer?


That's what we do already for nouveau, and there are no issues to
implementing it. Really that's driver-dependent, like the radeon atom
emission mechanism is.

Stephane
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Stephane Marchesin

On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:

 Hi all,

 I've been working on the DRI2 implementation recently and I'm starting
 to get a little confused, so I figured I'd throw a couple of questions
 out to the list.  First of, I wrote up this summary shortly after XD

 http://wiki.x.org/wiki/DRI2

 which upon re-reading is still pretty much up to date with what I'm
 trying to do.  The buzzword summary from the page has

 * Lockless
 * Always private back buffers
 * No clip rects in DRI driver
 * No DDX driver part
 * Minimal X server part
 * Swap buffer and clip rects in kernel
 * No SAREA

 I've implemented the DRI2 xserver module
 (http://cgit.freedesktop.org/~krh/xserver/log/?h=dri2) and the new drm
 ioctls that is uses
 (http://cgit.freedesktop.org/~krh/drm/log/?h=dri2).  I did the DDX
 part for the intel driver and DRI2 initialization consists of doing
 drmOpen (this is now up to the DDX driver), initialize the memory
 manager and use it to allocate stuff and then call DRI2ScreenInit(),
 passing in pScreen and the file descriptor.  Basically, all of
 i830_dri.c isn't used in this mode.

 It's all delightfully simple, but I'm starting to reconsider whether
 the lockless bullet point is realistic.   Note, the drawable lock is
 gone, we always render to private back buffers and do swap buffers in
 the kernel, so I'm only concerned with the DRI lock here.  The idea
 is that since we have the memory manager and the super-ioctl and the X
 server now can push cliprects into the kernel in one atomic operation,
 we would be able to get rid of the DRI lock.  My overall question,
 here is, is that feasible?


How do you plan to ensure that X didn't change the cliprects after you
emitted them to the DRM ?


I'm trying to figure out how context switches acutally work... the DRI
 lock is overloaded as context switcher, and there is code in the
 kernel to call out to a chipset specific context switch routine when
 the DRI lock is taken... but only ffb uses it... So I'm guessing the
 way context switches work today is that the DRI driver grabs the lock
 and after potentially updating the cliprects through X protocol, it
 emits all the state it depends on to the cards.  Is the state emission
 done by just writing out a bunch of registers?  Is this how the X
 server works too, except XAA/EXA acceleration doesn't depend on a lot
 of state and thus the DDX driver can emit everything for each
 operation?

 How would this work if we didn't have a lock?  You can't emit the
 state and then start rendering without a lock to keep the state in
 place...  If the kernel doesn't restore any state, whats the point of
 the drm_context_t we pass to the kernel in drmLock?  Should the kernel
 know how to restore state (this ties in to the email from jglisse on
 state tracking in drm and all the gallium jazz, I guess)?  How do we
 identify state to the kernel, and how do we pass it in in the
 super-ioctl?  Can we add a list of registers to be written and the
 values?  I talked to Dave about it and we agreed that adding a
 drm_context_t to the super-ioctl would work, but now I'm thinking if
 the kernel doesn't track any state it wont really work.  Maybe


I seem to recall Eric Anholt did optimize the emission of radeon state atoms
a long time ago, and he got some speed improvements. You'd have to ask him
how much though. This could give us a rough idea of the performance impact
of emitting full state vs needed state, although this doesn't measure the
gain of removing lock contention. I might be totally wrong on this, this
dates back to a couple of years ago.


cross-client state sharing isn't useful for performance as Keith and
 Roland argues, but if the kernel doesn't restore state when it sees a
 super-ioctl coming from a different context, who does?


Yes it probably has to go to the kernel anyway.

Stephane
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Kristian Høgsberg

On Nov 26, 2007 11:15 PM, Keith Packard [EMAIL PROTECTED] wrote:

 On Mon, 2007-11-26 at 17:15 -0500, Kristian Høgsberg wrote:

   -full state

 I assume you'll deal with hardware which supports multiple contexts and
 avoid the need to include full state with each buffer?

As for hardware contexts, I guess there are two cases; hardware that
has a fixed set of contexts builtin and hardware where a context is
just a piece of video memory that you can point to (effectively an
unlimited number of contexts).  I both cases, the kernel will need to
know how to activate a given context and the context handle should be
part of the super ioctl arguments.  In the case of unlimited contexts,
that may be all that's needed.  In the case of a fixed number of
contexts, you will need to be able to restore state when you have more
software contexts in use that you have hardware contexts.

I imagine one optimiation you could do with a fixed number of contexts
is to assume that loosing the context will be very rare, and just fail
the super-ioctl when it happens, and then expect user space to
resubmit with state emission preamble.  In fact it may work well for
single context hardware...

But the super-ioctl is chipset specific and we can decide on the
details there on a chipset to chipset basis.  If you have input to how
the super-ioctl for intel should look like to support lockless
operation for current and future intel chipset, I'd love to hear it.
And of course we can version our way out of trouble as a last resort.

cheers,
Kristian
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Keith Packard


On Tue, 2007-11-27 at 14:03 -0500, Kristian Høgsberg wrote:

 As for hardware contexts, I guess there are two cases; hardware that
 has a fixed set of contexts builtin and hardware where a context is
 just a piece of video memory that you can point to (effectively an
 unlimited number of contexts).

Yeah, intel hardware uses memory and has no intrinsic limit to the
number of contexts; they just need to be pinned in GTT space and given a
unique address.

   I both cases, the kernel will need to
 know how to activate a given context and the context handle should be
 part of the super ioctl arguments.

We needn't expose the contexts to user-space, just provide a virtual
consistent device and manage contexts in the kernel. We could add the
ability to manage contexts from user space for cases where that makes
sense (like, perhaps, in the X server where a context per client may be
useful).

 I imagine one optimiation you could do with a fixed number of contexts
 is to assume that loosing the context will be very rare, and just fail
 the super-ioctl when it happens, and then expect user space to
 resubmit with state emission preamble.  In fact it may work well for
 single context hardware...

I recall having the same discussion in the past; have the superioctl
fail so that the client needn't constantly compute the full state
restore on every submission may be a performance win for some hardware.
All that this requires is a flag to the kernel that says 'this
submission reinitializes the hardware', and an error return that says
'lost context'.

 But the super-ioctl is chipset specific and we can decide on the
 details there on a chipset to chipset basis.  If you have input to how
 the super-ioctl for intel should look like to support lockless
 operation for current and future intel chipset, I'd love to hear it.
 And of course we can version our way out of trouble as a last resort.

We should do the lockless and context stuff at the same time; doing
re-submit would be a bunch of work in the driver that would be wasted.

Right now, we're just trying to get 965 running with ttm; once that's
limping along, figuring out how to make it reasonable will be the next
task, and hardware context support is clearly a big part of that.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Kristian Høgsberg

On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote:
...
I both cases, the kernel will need to
  know how to activate a given context and the context handle should be
  part of the super ioctl arguments.

 We needn't expose the contexts to user-space, just provide a virtual
 consistent device and manage contexts in the kernel. We could add the
 ability to manage contexts from user space for cases where that makes
 sense (like, perhaps, in the X server where a context per client may be
 useful).

Oh, right we don't need one per GLContext, just per DRI client, mesa
handles switching between GL contexts.  What about multithreaded
rendering sharing the same drm fd?

  I imagine one optimiation you could do with a fixed number of contexts
  is to assume that loosing the context will be very rare, and just fail
  the super-ioctl when it happens, and then expect user space to
  resubmit with state emission preamble.  In fact it may work well for
  single context hardware...

 I recall having the same discussion in the past; have the superioctl
 fail so that the client needn't constantly compute the full state
 restore on every submission may be a performance win for some hardware.
 All that this requires is a flag to the kernel that says 'this
 submission reinitializes the hardware', and an error return that says
 'lost context'.

Exactly.

  But the super-ioctl is chipset specific and we can decide on the
  details there on a chipset to chipset basis.  If you have input to how
  the super-ioctl for intel should look like to support lockless
  operation for current and future intel chipset, I'd love to hear it.
  And of course we can version our way out of trouble as a last resort.

 We should do the lockless and context stuff at the same time; doing
 re-submit would be a bunch of work in the driver that would be wasted.

Is it that bad? We will still need the resubmit code for older
chipsets that dont have the hardware context support.  The drivers
already have the code to emit state in case of context loss, that code
just needs to be repurposed to generate a batch buffer to prepend to
the rendering code.  And the rendering code doesn't need to change
when resubmitting.  Will that work?

 Right now, we're just trying to get 965 running with ttm; once that's
 limping along, figuring out how to make it reasonable will be the next
 task, and hardware context support is clearly a big part of that.

Yeah - I'm trying to limit the scope of DRI2 so that we can have
redirected direct rendering and GLX1.4 in the tree sooner rather than
later (before the end of the year).  Maybe the best way to do that is
to keep the lock around for now and punt on the lock-less super-ioctl
if that really blocks on hardware context support.  How far back is
hardware contexts supported?

Kristian

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Kristian Høgsberg

On Nov 27, 2007 10:41 AM, Thomas Hellström [EMAIL PROTECTED] wrote:
...
 However, there are cases where it is very difficult to use cliprects
 from the drm, though  I wouldn't say impossible.
 
 The idea is to keep the cliprects in the kernel and only render double
 buffered.  The only code that needs to worry about cliprects is swap
 buffer, either immediate or synced to vblank.  What are the cliprect
 problems you mention?
 
 Kristian
 
 
 Hi, Kristian.
 Sorry for the late response.

 What I'm thinking about is the case where the swapbuffers needs to be
 done by the 3D engine, and with increasingly complex hardware this will
 at the very least mean some sort of pixel-shader code in the kernel, or
 perhaps loaded by the kernel firmware interface as a firmware snippet.

And the kernel will have to somehow communicate the list of clip rects
to this firmware/pixel-shader in one way or another, maybe fixing up
the code or providing a tri-list buffer or something.  I don't really
know, but it already sounds too complicated for the kernel, in my
opinion.

Another problem is that it's not just swapbuffer - anything that
touches the front buffer have to respect the cliprects - glCopyPixels
and glXCopySubBufferMESA - and thus have to happen in the kernel.

 If we take Poulsbo as an example, the 2D engine is present and open, and
 swapbuffers can be done using it, but since the 2D- and 3D engines
 operate separately they are synced in software by the TTM memory manager
 fence class arbitration code, and we lose performance since we cannot
 pipeline 3D tasks as we'd want to. If the 3D engine were open, we'd
 still need a vast amount of code in the kernel to be able to just to a
 3D blit.

 This is even more complicated by the fact that we may not be able to
 implement 3D blits in the kernel for IP protection reasons. Note that
 I'm just stating the problem here. I'm not arguing that this should
 influence the DRI2 design.

I understand, but I'm starting to doubt the cliprects and swapbuffer
in the kernel design anyhow.  I wasn't going this route originally,
but since we already had that in place for i915 vblank buffer swaps, I
figured we might as well go that route.  I'm not sure the buffer swap
from the vblank tasklet is really necessary to begin with... are there
benchmarks showing that the irq-userspace latency was too big for
this to work in userspace?

My proposal back at XDS was to have a shared host memory ring buffer
where the X server pushes cliprect changes and clients read it out
from there and I still think that's nice design.  In a lockless world,
the superioctl arguments optionally include the buffer head (as a
timestamp) so that the kernel can detect whether a swap buffer
batchbuffer is stale.  If it is, meaning the X server published
cliprect changes, the submit fails and the client must recompute the
batchbuffer for the swap command and resubmit after parsing the new
cliprects.  When rendering to a private back buffer, clip rects aren't
relevant and so the superioctl wont have the buffer head field set and
the kernel will never discard it.

I dunno, maybe I'm just rambling...
Kristian
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Kristian Høgsberg

On Nov 27, 2007 11:48 AM, Stephane Marchesin
[EMAIL PROTECTED] wrote:
 On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:
...
  It's all delightfully simple, but I'm starting to reconsider whether
  the lockless bullet point is realistic.   Note, the drawable lock is
  gone, we always render to private back buffers and do swap buffers in
  the kernel, so I'm only concerned with the DRI lock here.  The idea
  is that since we have the memory manager and the super-ioctl and the X
  server now can push cliprects into the kernel in one atomic operation,
  we would be able to get rid of the DRI lock.  My overall question,
  here is, is that feasible?

 How do you plan to ensure that X didn't change the cliprects after you
 emitted them to the DRM ?

The idea was that the buffer swap happens in the kernel, triggered by
an ioctl. The kernel generates the command stream to execute the swap
against the current set of cliprects.  The back buffers are always
private so the cliprects only come into play when copying from the
back buffer to the front buffer.  Single buffered visuals are secretly
double buffered and implemented the same way.

I'm trying to figure now whether it makes more sense to keep cliprects
and swapbuffer out of the kernel, which wouldn't change the above
much, except the swapbuffer case.  I described the idea for swapbuffer
in this case in my reply to Thomas: the X server publishes cliprects
to the clients through a shared ring buffer, and clients parse the
clip rect changes out of this buffer as they need it.  When posting a
swap buffer request, the buffer head should be included in the
super-ioctl so that the kernel can reject stale requests.  When that
happens, the client must parse the new cliprect info and resubmit an
updated swap buffer request.

Kristian
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Keith Whitwell

In general the problem with the superioctl returning 'fail' is that the client 
has to then go back in time and figure out what the state preamble would have 
been at the start of the batchbuffer.  Of course the easiest way to do this is 
to actually precompute the preamble at batchbuffer start time and store it in 
case the superioctl fails -- in which case, why not pass it to the kernel along 
with the rest of the batchbuffer and have the kernel decide whether or not to 
play it?

Keith

- Original Message 
From: Kristian Høgsberg [EMAIL PROTECTED]
To: Keith Packard [EMAIL PROTECTED]
Cc: Jerome Glisse [EMAIL PROTECTED]; Dave Airlie [EMAIL PROTECTED]; 
dri-devel@lists.sourceforge.net; Keith Whitwell [EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 8:44:48 PM
Subject: Re: DRI2 and lock-less operation


On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote:
...
I both cases, the kernel will need to
  know how to activate a given context and the context handle should
 be
  part of the super ioctl arguments.

 We needn't expose the contexts to user-space, just provide a virtual
 consistent device and manage contexts in the kernel. We could add the
 ability to manage contexts from user space for cases where that makes
 sense (like, perhaps, in the X server where a context per client may
 be
 useful).

Oh, right we don't need one per GLContext, just per DRI client, mesa
handles switching between GL contexts.  What about multithreaded
rendering sharing the same drm fd?

  I imagine one optimiation you could do with a fixed number of
 contexts
  is to assume that loosing the context will be very rare, and just
 fail
  the super-ioctl when it happens, and then expect user space to
  resubmit with state emission preamble.  In fact it may work well
 for
  single context hardware...

 I recall having the same discussion in the past; have the superioctl
 fail so that the client needn't constantly compute the full state
 restore on every submission may be a performance win for some
 hardware.
 All that this requires is a flag to the kernel that says 'this
 submission reinitializes the hardware', and an error return that says
 'lost context'.

Exactly.

  But the super-ioctl is chipset specific and we can decide on the
  details there on a chipset to chipset basis.  If you have input to
 how
  the super-ioctl for intel should look like to support lockless
  operation for current and future intel chipset, I'd love to hear
 it.
  And of course we can version our way out of trouble as a last
 resort.

 We should do the lockless and context stuff at the same time; doing
 re-submit would be a bunch of work in the driver that would be
 wasted.

Is it that bad? We will still need the resubmit code for older
chipsets that dont have the hardware context support.  The drivers
already have the code to emit state in case of context loss, that code
just needs to be repurposed to generate a batch buffer to prepend to
the rendering code.  And the rendering code doesn't need to change
when resubmitting.  Will that work?

 Right now, we're just trying to get 965 running with ttm; once that's
 limping along, figuring out how to make it reasonable will be the
 next
 task, and hardware context support is clearly a big part of that.

Yeah - I'm trying to limit the scope of DRI2 so that we can have
redirected direct rendering and GLX1.4 in the tree sooner rather than
later (before the end of the year).  Maybe the best way to do that is
to keep the lock around for now and punt on the lock-less super-ioctl
if that really blocks on hardware context support.  How far back is
hardware contexts supported?

Kristian




-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Stephane Marchesin

On 11/27/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:

 On Nov 27, 2007 11:48 AM, Stephane Marchesin
 [EMAIL PROTECTED] wrote:
  On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:
 ...
   It's all delightfully simple, but I'm starting to reconsider whether
   the lockless bullet point is realistic.   Note, the drawable lock is
   gone, we always render to private back buffers and do swap buffers in
   the kernel, so I'm only concerned with the DRI lock here.  The idea
   is that since we have the memory manager and the super-ioctl and the X
   server now can push cliprects into the kernel in one atomic operation,
   we would be able to get rid of the DRI lock.  My overall question,
   here is, is that feasible?
 
  How do you plan to ensure that X didn't change the cliprects after you
  emitted them to the DRM ?

 The idea was that the buffer swap happens in the kernel, triggered by
 an ioctl. The kernel generates the command stream to execute the swap
 against the current set of cliprects.  The back buffers are always
 private so the cliprects only come into play when copying from the
 back buffer to the front buffer.  Single buffered visuals are secretly
 double buffered and implemented the same way.


 What if cliprects change between the time you emit them to the fifo and the
time the blit gets executed by the card ? Do you sync to the card in-drm,
effectively killing performance ?

I'm trying to figure now whether it makes more sense to keep cliprects
 and swapbuffer out of the kernel, which wouldn't change the above
 much, except the swapbuffer case.  I described the idea for swapbuffer
 in this case in my reply to Thomas: the X server publishes cliprects
 to the clients through a shared ring buffer, and clients parse the
 clip rect changes out of this buffer as they need it.  When posting a
 swap buffer request, the buffer head should be included in the
 super-ioctl so that the kernel can reject stale requests.  When that
 happens, the client must parse the new cliprect info and resubmit an
 updated swap buffer request.


I fail to see how this works with a lockless design. How do you ensure the X
server doesn't change cliprects between the time it has written those in the
shared ring buffer and the time the DRI client picks them up and has the
command fired and actually executed ? Do you lock out the server during that
time ?

Stephane
-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Stephane Marchesin

On 11/27/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:


 Yeah - I'm trying to limit the scope of DRI2 so that we can have
 redirected direct rendering and GLX1.4 in the tree sooner rather than
 later (before the end of the year).  Maybe the best way to do that is
 to keep the lock around for now and punt on the lock-less super-ioctl
 if that really blocks on hardware context support.  How far back is
 hardware contexts supported?


Hmm you can't be binary like that. I think there are 3 classes of devices
around :
- no context support at all : old radeon, old intel
- multiple fifos, no hw context switching : newer radeon, newer intel
- multiple fifos, hw context switching : all nvidia

The third case obviously will never require any kind of state re-emitting.

Stephane
-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Keith Packard


On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote:

 Oh, right we don't need one per GLContext, just per DRI client, mesa
 handles switching between GL contexts.  What about multithreaded
 rendering sharing the same drm fd?

For that, you'd either want 'switch context' ioctls, or context
arguments with every request. The former is easy to retrofit, the latter
would have to be done right now.

  All that this requires is a flag to the kernel that says 'this
  submission reinitializes the hardware', and an error return that says
  'lost context'.
 
 Exactly.

I think Keith's comment that knowing how to reinit is effectively the
same as computing the reinitialization buffer may be relevant here. I'm
not entirely sure this is true; it might be possible to save higher
level information about the state than the ring contents. For Intel, I'm
not planning on using this anyway, so I suspect Radeon will be the test
case here.

  We should do the lockless and context stuff at the same time; doing
  re-submit would be a bunch of work in the driver that would be wasted.
 
 Is it that bad? We will still need the resubmit code for older
 chipsets that dont have the hardware context support. 

From what I can tell, all Intel chips support MI_SET_CONTEXT, with the
possible exception of i830. I'm getting some additional docs for that
chip to see what it does, but the i845 docs mention the 'new context
support'; dunno if that's new as of i830 or i845...

  The drivers
 already have the code to emit state in case of context loss, that code
 just needs to be repurposed to generate a batch buffer to prepend to
 the rendering code.  And the rendering code doesn't need to change
 when resubmitting.  Will that work?

With MI_SET_CONTEXT, you should never lose context, so we'd never need
to be able to do this.

 Yeah - I'm trying to limit the scope of DRI2 so that we can have
 redirected direct rendering and GLX1.4 in the tree sooner rather than
 later (before the end of the year).  Maybe the best way to do that is
 to keep the lock around for now and punt on the lock-less super-ioctl
 if that really blocks on hardware context support.  How far back is
 hardware contexts supported?

Hardware context support actually looks really easy to add; it's just a
page in GTT per drm client, then an extra instruction in the ring when
context switching is required. I'll have to see if the i830 supports it
though. The current mechanism actually looks a lot harder to handle; I
don't know why the driver didn't use MI_SET_CONTEXT from the very start.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread Keith Packard


On Wed, 2007-11-28 at 00:52 +0100, Stephane Marchesin wrote:


 The third case obviously will never require any kind of state
 re-emitting. 

Unless you run out of hardware contexts.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread glisse

On Tue, Nov 27, 2007 at 03:44:48PM -0500, Kristian Høgsberg wrote:
 On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote:
 ...
 I both cases, the kernel will need to
   know how to activate a given context and the context handle should be
   part of the super ioctl arguments.
 
  We needn't expose the contexts to user-space, just provide a virtual
  consistent device and manage contexts in the kernel. We could add the
  ability to manage contexts from user space for cases where that makes
  sense (like, perhaps, in the X server where a context per client may be
  useful).
 
 Oh, right we don't need one per GLContext, just per DRI client, mesa
 handles switching between GL contexts.  What about multithreaded
 rendering sharing the same drm fd?
 
   I imagine one optimiation you could do with a fixed number of contexts
   is to assume that loosing the context will be very rare, and just fail
   the super-ioctl when it happens, and then expect user space to
   resubmit with state emission preamble.  In fact it may work well for
   single context hardware...
 
  I recall having the same discussion in the past; have the superioctl
  fail so that the client needn't constantly compute the full state
  restore on every submission may be a performance win for some hardware.
  All that this requires is a flag to the kernel that says 'this
  submission reinitializes the hardware', and an error return that says
  'lost context'.
 
 Exactly.
 
   But the super-ioctl is chipset specific and we can decide on the
   details there on a chipset to chipset basis.  If you have input to how
   the super-ioctl for intel should look like to support lockless
   operation for current and future intel chipset, I'd love to hear it.
   And of course we can version our way out of trouble as a last resort.
 
  We should do the lockless and context stuff at the same time; doing
  re-submit would be a bunch of work in the driver that would be wasted.
 
 Is it that bad? We will still need the resubmit code for older
 chipsets that dont have the hardware context support.  The drivers
 already have the code to emit state in case of context loss, that code
 just needs to be repurposed to generate a batch buffer to prepend to
 the rendering code.  And the rendering code doesn't need to change
 when resubmitting.  Will that work?
 
  Right now, we're just trying to get 965 running with ttm; once that's
  limping along, figuring out how to make it reasonable will be the next
  task, and hardware context support is clearly a big part of that.
 
 Yeah - I'm trying to limit the scope of DRI2 so that we can have
 redirected direct rendering and GLX1.4 in the tree sooner rather than
 later (before the end of the year).  Maybe the best way to do that is
 to keep the lock around for now and punt on the lock-less super-ioctl
 if that really blocks on hardware context support.  How far back is
 hardware contexts supported?
 
 Kristian

Maybe just accept than only drivers converted to dri2 will be lockless
ie if you are dri2 you have superiotcl and others things like that as
anyway i believe what GLX1.4 give is bit useless without a proper driver
(TTM at least).

Cheers,
Jerome Glisse

-
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-27 Thread glisse

On Wed, Nov 28, 2007 at 12:43:41AM +0100, Stephane Marchesin wrote:
 On 11/27/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:
 
  On Nov 27, 2007 11:48 AM, Stephane Marchesin
  [EMAIL PROTECTED] wrote:
   On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote:
  ...
It's all delightfully simple, but I'm starting to reconsider whether
the lockless bullet point is realistic.   Note, the drawable lock is
gone, we always render to private back buffers and do swap buffers in
the kernel, so I'm only concerned with the DRI lock here.  The idea
is that since we have the memory manager and the super-ioctl and the X
server now can push cliprects into the kernel in one atomic operation,
we would be able to get rid of the DRI lock.  My overall question,
here is, is that feasible?
  
   How do you plan to ensure that X didn't change the cliprects after you
   emitted them to the DRM ?
 
  The idea was that the buffer swap happens in the kernel, triggered by
  an ioctl. The kernel generates the command stream to execute the swap
  against the current set of cliprects.  The back buffers are always
  private so the cliprects only come into play when copying from the
  back buffer to the front buffer.  Single buffered visuals are secretly
  double buffered and implemented the same way.
 
 
  What if cliprects change between the time you emit them to the fifo and the
 time the blit gets executed by the card ? Do you sync to the card in-drm,
 effectively killing performance ?
 
 I'm trying to figure now whether it makes more sense to keep cliprects
  and swapbuffer out of the kernel, which wouldn't change the above
  much, except the swapbuffer case.  I described the idea for swapbuffer
  in this case in my reply to Thomas: the X server publishes cliprects
  to the clients through a shared ring buffer, and clients parse the
  clip rect changes out of this buffer as they need it.  When posting a
  swap buffer request, the buffer head should be included in the
  super-ioctl so that the kernel can reject stale requests.  When that
  happens, the client must parse the new cliprect info and resubmit an
  updated swap buffer request.
 
 
 I fail to see how this works with a lockless design. How do you ensure the X
 server doesn't change cliprects between the time it has written those in the
 shared ring buffer and the time the DRI client picks them up and has the
 command fired and actually executed ? Do you lock out the server during that
 time ?
 
 Stephane

All this is starting to confuse me a bit :) So, if i am right, all
client render to private buffer and the you want to blit this client
into the front buffer. In composite world it's up to the compositor
to do this and so you don't have to care about cliprect.

I think we can have somekind of dumb default compositor that would
handle this blit directly into the X server. And this compositor might
even not use cliprect.

For window|pixmap|... resizing can the following scheme fit:

Single buffered path (ie no backbuffer but client still render
to a private buffer).
1) X get the resize event
2) X ask drm to allocate new drawable with the new size
3) X enqueu a query to drm which copy current buffer content into
   the new one and also update drawable so further rendering
   request will happen in the new buffer
4) X start using the new buffer when compositing into the
   scanout buffer

So you might see rendering artifact but i guess this has to be
exepected in single buffered applications where size change
can happen at any time.

Double buffered path:
1) X get the resize event
2) X allocate new front buffer and blit old front buffer into
   it (i might be wrong and X might not need to preserver content
   on resize) so X can start blitting using new buffer size but
   with old content.
   X allocate new back buffer
3) X enqueu a query to drm to change drawable back buffer

If there is no pending rendering on the current back buffer (old size):
4) drm update drawable so back buffer is the new one (with new size)
Finished with resizing

If there is pending rendering on the current back buffer (old size):
4) On next swapbuffer X blit back buffer into the front buffer
   (if there is a need to preserve content) deallocate back buffer
   and allocate a new one with new size (so front buffer stay the
   same just a part of is updated)
Finished with resizing

So this follow comment Keith made on the wiki about resizing, i think
it's best strategy even though if redrawing the window is slow then
one might never see proper content if he is constantly resizing the
window (but we can't do much against broken user ;))

Of course in this scheme you do the blitting from private client
buffer (wether its the front buffer or the only buffer for single
buffered app) in client space ie in the X server (so no need for
cliprect in kernel). Note that you blit as soon as step 3 finished
for single buffering and as soon as step 2 finished for double
buffering.

Does

Re: DRI2 and lock-less operation

2007-11-26 Thread Kristian Høgsberg

On Nov 22, 2007 4:23 AM, Keith Whitwell [EMAIL PROTECTED] wrote:
...
  My guess for one way is to store a buffer object with the current state
  emission in it, and submit it with the superioctl maybe, and if we have
  lost context emit it before the batchbuffer..

 The way drivers actually work at the moment is to emit a full state as a
 preamble to each batchbuffer.  Depending on the hardware, this can be
 pretty low overhead, and it seems that the trend in hardware is to make
 this operation cheaper and cheaper.  This works fine without the lock.

 There is another complimentary trend to support one way or another
 multiple hardware contexts (obviously nvidia have done this for years),
 meaning that effectively the hardware (effectively) does the context
 switches.  This is probably how most cards will end up working in the
 future, if not already.

 Neither of these need a lock for detecting context switches.

Sure enough, but the problem is that without the lock userspace can't
say oops, I lost the context, let me prepend this state emission
preamble to the batchbuffer. in a race free way.  If we want
conditional state emission, we need to make that decision in the
kernel.

For example, the super ioctl could send the state emission code as a
separate buffer and also include the expected context handle.  This
lets the kernel compare the context handle supplied in the super ioctl
with the most recently active context handle, and if they differ, the
kernel queues the state emission buffer first and then the rendering
buffer.  If the context handles match, the kernel just queues the
rendering batch buffer.

However, this means that user space must prepare the state emission
code for each submission, whether or not it will actually be used.
I'm not sure if this is too much overhead or if it's negligible?

Kristian

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-26 Thread Keith Whitwell

Kristian Høgsberg wrote:
 On Nov 22, 2007 4:23 AM, Keith Whitwell [EMAIL PROTECTED] wrote:
 ...
 My guess for one way is to store a buffer object with the current state
 emission in it, and submit it with the superioctl maybe, and if we have
 lost context emit it before the batchbuffer..
 The way drivers actually work at the moment is to emit a full state as a
 preamble to each batchbuffer.  Depending on the hardware, this can be
 pretty low overhead, and it seems that the trend in hardware is to make
 this operation cheaper and cheaper.  This works fine without the lock.

 There is another complimentary trend to support one way or another
 multiple hardware contexts (obviously nvidia have done this for years),
 meaning that effectively the hardware (effectively) does the context
 switches.  This is probably how most cards will end up working in the
 future, if not already.

 Neither of these need a lock for detecting context switches.
 
 Sure enough, but the problem is that without the lock userspace can't
 say oops, I lost the context, let me prepend this state emission
 preamble to the batchbuffer. in a race free way.  If we want
 conditional state emission, we need to make that decision in the
 kernel.

The cases I describe above don't try to do this, but if you really 
wanted to, the way to do it would be to have userspace always emit the 
preamble but pass two offsets to the kernel, one at the start of the 
preamble, the other after it.  Then the kernel can choose.

I don't think there's a great deal to be gained from this optimization, 
though.


 
 For example, the super ioctl could send the state emission code as a
 separate buffer and also include the expected context handle.  This
 lets the kernel compare the context handle supplied in the super ioctl
 with the most recently active context handle, and if they differ, the
 kernel queues the state emission buffer first and then the rendering
 buffer.  If the context handles match, the kernel just queues the
 rendering batch buffer.
 
 However, this means that user space must prepare the state emission
 code for each submission, whether or not it will actually be used.
 I'm not sure if this is too much overhead or if it's negligible?

I think both preparing it on CPU and executing it on GPU are likely to 
be pretty negligible, but some experimentation on a system with just a 
single app running should show this quickly one way or another.

Keith


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-26 Thread Kristian Høgsberg

On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote:
...
 I will go this way too for r300/r400/r500 there is not so much registers
 change with different contexts and registers which need special treatment
 will be handled by the drm (this boils down to where 3d get rendered and
 where is the zbuffer and pitch/tile informations on this 2 buffers; this
 informations will be embedded in drm_drawable as the cliprect if i am
 right :)). It will be up to client to reemit enough state for the card
 to be in good shape for its rendering and i don't think it's worthwhile
 to provide facilities to keep hw in a given state.

Are you suggesting we emit the state for every batchbuffer submission?
 As I wrote in my reply to Keith, without a lock, you can't check that
the state is what you expect and the submit a batchbuffer from user
space.  The check has to be done in the kernel, and then kernel will
then emit the state conditionally.   And even if this scheme can
eliminate unecessary state emission, the state still have to be passed
to the kernel with every batch buffer submission, in case the kernel
need to emit it.

 So i don't need a lock
 and indeed my actual code doesn't use any except for ring buffer emission
 (only shared area among different client i can see in my case).

So you do need a lock?  Could the ring buffer management be done in
the drm by the super-ioctl code and would that eliminate the need for
a sarea?

Kristian

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-26 Thread Kristian Høgsberg

On Nov 22, 2007 4:03 AM, Thomas Hellström [EMAIL PROTECTED] wrote:
...
 There are probably various ways to do this, which is another argument
 for keeping super-ioctls device-specific.
 For i915-type hardware, Dave's approach above is probably the most
 attracting one.
 For Poulsbo, all state is always implicitly included, usually as a
 reference to a buffer object, so we don't really bother about contexts here.
 For hardware like the Unichrome, the state is stored in a limited set of
 registers.
 Here the drm can keep a copy of those registers for each context and do
 a smart update on a context switch.

 However, there are cases where it is very difficult to use cliprects
 from the drm, though  I wouldn't say impossible.

The idea is to keep the cliprects in the kernel and only render double
buffered.  The only code that needs to worry about cliprects is swap
buffer, either immediate or synced to vblank.  What are the cliprect
problems you mention?

Kristian
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-26 Thread Jerome Glisse

Kristian Høgsberg wrote:
 On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote:
 ...
 I will go this way too for r300/r400/r500 there is not so much registers
 change with different contexts and registers which need special treatment
 will be handled by the drm (this boils down to where 3d get rendered and
 where is the zbuffer and pitch/tile informations on this 2 buffers; this
 informations will be embedded in drm_drawable as the cliprect if i am
 right :)). It will be up to client to reemit enough state for the card
 to be in good shape for its rendering and i don't think it's worthwhile
 to provide facilities to keep hw in a given state.
 
 Are you suggesting we emit the state for every batchbuffer submission?
  As I wrote in my reply to Keith, without a lock, you can't check that
 the state is what you expect and the submit a batchbuffer from user
 space.  The check has to be done in the kernel, and then kernel will
 then emit the state conditionally.   And even if this scheme can
 eliminate unecessary state emission, the state still have to be passed
 to the kernel with every batch buffer submission, in case the kernel
 need to emit it.

I didn't explained myself properly, i did mean that it's up to the client
to reemit all state in each call to superioctl so there will be full state
emission but i won't check it ie if it doesn't emit full state then userspace
can just expect to buggy rendering. Meanwhile there is few things i don't
want the userspace to mess with as they likely need special treatment like
the offset in ram where 3d rendering is going to happen, where are ancillary
buffers, ... i expect to have all this informations attached to a drm_drawable
(rendering buffer, ancillary buffer, ...) so each call of superioctl will have
this:
-drm_drawable (where rendering will be)
-full state
-additional buffers needed for rendering (vertex buffer, textures, ...)


 So i don't need a lock
 and indeed my actual code doesn't use any except for ring buffer emission
 (only shared area among different client i can see in my case).
 
 So you do need a lock?  Could the ring buffer management be done in
 the drm by the super-ioctl code and would that eliminate the need for
 a sarea?

This ring lock is for internal use only, so if one client is in superioctl
and another is emitting a fence both need to write to ring but with different
path to avoid any userspace lock i have a kernel lock which any functions
writing to ring need to take; as writing to ring should be short this shouldn't
hurt. So no i don't need a lock and i don't want a lock :)

Hope i expressed myself better :)

Cheers,
Jerome Glisse


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-26 Thread Kristian Høgsberg

On Nov 26, 2007 3:40 PM, Jerome Glisse [EMAIL PROTECTED] wrote:
 Kristian Høgsberg wrote:
  On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote:
  ...
  I will go this way too for r300/r400/r500 there is not so much registers
  change with different contexts and registers which need special treatment
  will be handled by the drm (this boils down to where 3d get rendered and
  where is the zbuffer and pitch/tile informations on this 2 buffers; this
  informations will be embedded in drm_drawable as the cliprect if i am
  right :)). It will be up to client to reemit enough state for the card
  to be in good shape for its rendering and i don't think it's worthwhile
  to provide facilities to keep hw in a given state.
 
  Are you suggesting we emit the state for every batchbuffer submission?
   As I wrote in my reply to Keith, without a lock, you can't check that
  the state is what you expect and the submit a batchbuffer from user
  space.  The check has to be done in the kernel, and then kernel will
  then emit the state conditionally.   And even if this scheme can
  eliminate unecessary state emission, the state still have to be passed
  to the kernel with every batch buffer submission, in case the kernel
  need to emit it.

 I didn't explained myself properly, i did mean that it's up to the client
 to reemit all state in each call to superioctl so there will be full state
 emission but i won't check it ie if it doesn't emit full state then userspace
 can just expect to buggy rendering. Meanwhile there is few things i don't
 want the userspace to mess with as they likely need special treatment like
 the offset in ram where 3d rendering is going to happen, where are ancillary
 buffers, ... i expect to have all this informations attached to a drm_drawable
 (rendering buffer, ancillary buffer, ...) so each call of superioctl will have
 this:
 -drm_drawable (where rendering will be)
 -full state
 -additional buffers needed for rendering (vertex buffer, textures, ...)

Great, thanks for clarifying.  Sounds good.

  So i don't need a lock
  and indeed my actual code doesn't use any except for ring buffer emission
  (only shared area among different client i can see in my case).
 
  So you do need a lock?  Could the ring buffer management be done in
  the drm by the super-ioctl code and would that eliminate the need for
  a sarea?

 This ring lock is for internal use only, so if one client is in superioctl
 and another is emitting a fence both need to write to ring but with different
 path to avoid any userspace lock i have a kernel lock which any functions
 writing to ring need to take; as writing to ring should be short this 
 shouldn't
 hurt. So no i don't need a lock and i don't want a lock :)

 Hope i expressed myself better :)

Hehe, I guess I've been a bit unclear too: what I want to make sure we
can get rid of is the *userspace* lock that's currently shared between
the X server and the DRI clients.  What hear you saying above is that
you still need a kernel side lock to synchronize access to the ring
buffer, which of course is required.  The ring buffer and the lock
that protects is lives in the kernel and user space can't access it
directly.  When emitting batchbuffers or fences, the ioctl() will need
to take the lock, but will always release it before returning back to
userspace.

Sounds like this will all work out after all :)
Kristian
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-26 Thread Jerome Glisse

Kristian Høgsberg wrote:
 On Nov 26, 2007 3:40 PM, Jerome Glisse [EMAIL PROTECTED] wrote:
 Kristian Høgsberg wrote:
 On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote:
 ...
 I will go this way too for r300/r400/r500 there is not so much registers
 change with different contexts and registers which need special treatment
 will be handled by the drm (this boils down to where 3d get rendered and
 where is the zbuffer and pitch/tile informations on this 2 buffers; this
 informations will be embedded in drm_drawable as the cliprect if i am
 right :)). It will be up to client to reemit enough state for the card
 to be in good shape for its rendering and i don't think it's worthwhile
 to provide facilities to keep hw in a given state.
 Are you suggesting we emit the state for every batchbuffer submission?
  As I wrote in my reply to Keith, without a lock, you can't check that
 the state is what you expect and the submit a batchbuffer from user
 space.  The check has to be done in the kernel, and then kernel will
 then emit the state conditionally.   And even if this scheme can
 eliminate unecessary state emission, the state still have to be passed
 to the kernel with every batch buffer submission, in case the kernel
 need to emit it.
 I didn't explained myself properly, i did mean that it's up to the client
 to reemit all state in each call to superioctl so there will be full state
 emission but i won't check it ie if it doesn't emit full state then userspace
 can just expect to buggy rendering. Meanwhile there is few things i don't
 want the userspace to mess with as they likely need special treatment like
 the offset in ram where 3d rendering is going to happen, where are ancillary
 buffers, ... i expect to have all this informations attached to a 
 drm_drawable
 (rendering buffer, ancillary buffer, ...) so each call of superioctl will 
 have
 this:
 -drm_drawable (where rendering will be)
 -full state
 -additional buffers needed for rendering (vertex buffer, textures, ...)
 
 Great, thanks for clarifying.  Sounds good.
 
 So i don't need a lock
 and indeed my actual code doesn't use any except for ring buffer emission
 (only shared area among different client i can see in my case).
 So you do need a lock?  Could the ring buffer management be done in
 the drm by the super-ioctl code and would that eliminate the need for
 a sarea?
 This ring lock is for internal use only, so if one client is in superioctl
 and another is emitting a fence both need to write to ring but with different
 path to avoid any userspace lock i have a kernel lock which any functions
 writing to ring need to take; as writing to ring should be short this 
 shouldn't
 hurt. So no i don't need a lock and i don't want a lock :)

 Hope i expressed myself better :)
 
 Hehe, I guess I've been a bit unclear too: what I want to make sure we
 can get rid of is the *userspace* lock that's currently shared between
 the X server and the DRI clients.  What hear you saying above is that
 you still need a kernel side lock to synchronize access to the ring
 buffer, which of course is required.  The ring buffer and the lock
 that protects is lives in the kernel and user space can't access it
 directly.  When emitting batchbuffers or fences, the ioctl() will need
 to take the lock, but will always release it before returning back to
 userspace.
 
 Sounds like this will all work out after all :)
 Kristian

Yes, between i started looking at your dri2 branch and didn't find dri2
branch in your intel ddx repository did you pushed the code anywhere else ?
Would help me to see what i need to do for dri2 in ddx.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-26 Thread Keith Packard


On Mon, 2007-11-26 at 17:15 -0500, Kristian Høgsberg wrote:

  -full state

I assume you'll deal with hardware which supports multiple contexts and
avoid the need to include full state with each buffer?

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-22 Thread Dave Airlie


 
 I'm trying to figure out how context switches acutally work... the DRI
 lock is overloaded as context switcher, and there is code in the
 kernel to call out to a chipset specific context switch routine when
 the DRI lock is taken... but only ffb uses it... So I'm guessing the
 way context switches work today is that the DRI driver grabs the lock
 and after potentially updating the cliprects through X protocol, it
 emits all the state it depends on to the cards.  Is the state emission
 done by just writing out a bunch of registers?  Is this how the X
 server works too, except XAA/EXA acceleration doesn't depend on a lot
 of state and thus the DDX driver can emit everything for each
 operation?

So yes userspaces notices context has changed and just re-emits everything 
into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at 
least there is an invarient state emission functions that notices what the 
context was and what the last server 3D users was (EXA or Xv texturing) 
and just dumps the state into the batchbuffer.. (or currently into the 
ring)

 How would this work if we didn't have a lock?  You can't emit the
 state and then start rendering without a lock to keep the state in
 place...  If the kernel doesn't restore any state, whats the point of
 the drm_context_t we pass to the kernel in drmLock?  Should the kernel
 know how to restore state (this ties in to the email from jglisse on
 state tracking in drm and all the gallium jazz, I guess)?  How do we
 identify state to the kernel, and how do we pass it in in the
 super-ioctl?  Can we add a list of registers to be written and the
 values?  I talked to Dave about it and we agreed that adding a
 drm_context_t to the super-ioctl would work, but now I'm thinking if
 the kernel doesn't track any state it wont really work.  Maybe
 cross-client state sharing isn't useful for performance as Keith and
 Roland argues, but if the kernel doesn't restore state when it sees a
 super-ioctl coming from a different context, who does?
 

My guess for one way is to store a buffer object with the current state 
emission in it, and submit it with the superioctl maybe, and if we have 
lost context emit it before the batchbuffer..

Dave.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-22 Thread Thomas Hellström

Dave Airlie wrote:
 I'm trying to figure out how context switches acutally work... the DRI
 lock is overloaded as context switcher, and there is code in the
 kernel to call out to a chipset specific context switch routine when
 the DRI lock is taken... but only ffb uses it... So I'm guessing the
 way context switches work today is that the DRI driver grabs the lock
 and after potentially updating the cliprects through X protocol, it
 emits all the state it depends on to the cards.  Is the state emission
 done by just writing out a bunch of registers?  Is this how the X
 server works too, except XAA/EXA acceleration doesn't depend on a lot
 of state and thus the DDX driver can emit everything for each
 operation?
 

 So yes userspaces notices context has changed and just re-emits everything 
 into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at 
 least there is an invarient state emission functions that notices what the 
 context was and what the last server 3D users was (EXA or Xv texturing) 
 and just dumps the state into the batchbuffer.. (or currently into the 
 ring)

   
 How would this work if we didn't have a lock?  You can't emit the
 state and then start rendering without a lock to keep the state in
 place...  If the kernel doesn't restore any state, whats the point of
 the drm_context_t we pass to the kernel in drmLock?  Should the kernel
 know how to restore state (this ties in to the email from jglisse on
 state tracking in drm and all the gallium jazz, I guess)?  How do we
 identify state to the kernel, and how do we pass it in in the
 super-ioctl?  Can we add a list of registers to be written and the
 values?  I talked to Dave about it and we agreed that adding a
 drm_context_t to the super-ioctl would work, but now I'm thinking if
 the kernel doesn't track any state it wont really work.  Maybe
 cross-client state sharing isn't useful for performance as Keith and
 Roland argues, but if the kernel doesn't restore state when it sees a
 super-ioctl coming from a different context, who does?

 

 My guess for one way is to store a buffer object with the current state 
 emission in it, and submit it with the superioctl maybe, and if we have 
 lost context emit it before the batchbuffer..

   
There are probably various ways to do this, which is another argument 
for keeping super-ioctls device-specific.
For i915-type hardware, Dave's approach above is probably the most 
attracting one.
For Poulsbo, all state is always implicitly included, usually as a 
reference to a buffer object, so we don't really bother about contexts here.
For hardware like the Unichrome, the state is stored in a limited set of 
registers.
Here the drm can keep a copy of those registers for each context and do 
a smart update on a context switch.

However, there are cases where it is very difficult to use cliprects 
from the drm, though  I wouldn't say impossible.
/Thomas
 

 Dave.
   





 -
 This SF.net email is sponsored by: Microsoft
 Defy all challenges. Microsoft(R) Visual Studio 2005.
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 --
 ___
 Dri-devel mailing list
 Dri-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dri-devel
   




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-22 Thread Keith Whitwell

Dave Airlie wrote:
 I'm trying to figure out how context switches acutally work... the DRI
 lock is overloaded as context switcher, and there is code in the
 kernel to call out to a chipset specific context switch routine when
 the DRI lock is taken... but only ffb uses it... So I'm guessing the
 way context switches work today is that the DRI driver grabs the lock
 and after potentially updating the cliprects through X protocol, it
 emits all the state it depends on to the cards.  Is the state emission
 done by just writing out a bunch of registers?  Is this how the X
 server works too, except XAA/EXA acceleration doesn't depend on a lot
 of state and thus the DDX driver can emit everything for each
 operation?
 
 So yes userspaces notices context has changed and just re-emits everything 
 into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at 
 least there is an invarient state emission functions that notices what the 
 context was and what the last server 3D users was (EXA or Xv texturing) 
 and just dumps the state into the batchbuffer.. (or currently into the 
 ring)
 
 How would this work if we didn't have a lock?  You can't emit the
 state and then start rendering without a lock to keep the state in
 place...  If the kernel doesn't restore any state, whats the point of
 the drm_context_t we pass to the kernel in drmLock?  Should the kernel
 know how to restore state (this ties in to the email from jglisse on
 state tracking in drm and all the gallium jazz, I guess)?  How do we
 identify state to the kernel, and how do we pass it in in the
 super-ioctl?  Can we add a list of registers to be written and the
 values?  I talked to Dave about it and we agreed that adding a
 drm_context_t to the super-ioctl would work, but now I'm thinking if
 the kernel doesn't track any state it wont really work.  Maybe
 cross-client state sharing isn't useful for performance as Keith and
 Roland argues, but if the kernel doesn't restore state when it sees a
 super-ioctl coming from a different context, who does?

 
 My guess for one way is to store a buffer object with the current state 
 emission in it, and submit it with the superioctl maybe, and if we have 
 lost context emit it before the batchbuffer..

The way drivers actually work at the moment is to emit a full state as a 
preamble to each batchbuffer.  Depending on the hardware, this can be 
pretty low overhead, and it seems that the trend in hardware is to make 
this operation cheaper and cheaper.  This works fine without the lock.

There is another complimentary trend to support one way or another 
multiple hardware contexts (obviously nvidia have done this for years), 
meaning that effectively the hardware (effectively) does the context 
switches.  This is probably how most cards will end up working in the 
future, if not already.

Neither of these need a lock for detecting context switches.

Keith

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: DRI2 and lock-less operation

2007-11-22 Thread Jerome Glisse

Keith Whitwell wrote:
 Dave Airlie wrote:
 I'm trying to figure out how context switches acutally work... the DRI
 lock is overloaded as context switcher, and there is code in the
 kernel to call out to a chipset specific context switch routine when
 the DRI lock is taken... but only ffb uses it... So I'm guessing the
 way context switches work today is that the DRI driver grabs the lock
 and after potentially updating the cliprects through X protocol, it
 emits all the state it depends on to the cards.  Is the state emission
 done by just writing out a bunch of registers?  Is this how the X
 server works too, except XAA/EXA acceleration doesn't depend on a lot
 of state and thus the DDX driver can emit everything for each
 operation?
 So yes userspaces notices context has changed and just re-emits everything 
 into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at 
 least there is an invarient state emission functions that notices what the 
 context was and what the last server 3D users was (EXA or Xv texturing) 
 and just dumps the state into the batchbuffer.. (or currently into the 
 ring)

 How would this work if we didn't have a lock?  You can't emit the
 state and then start rendering without a lock to keep the state in
 place...  If the kernel doesn't restore any state, whats the point of
 the drm_context_t we pass to the kernel in drmLock?  Should the kernel
 know how to restore state (this ties in to the email from jglisse on
 state tracking in drm and all the gallium jazz, I guess)?  How do we
 identify state to the kernel, and how do we pass it in in the
 super-ioctl?  Can we add a list of registers to be written and the
 values?  I talked to Dave about it and we agreed that adding a
 drm_context_t to the super-ioctl would work, but now I'm thinking if
 the kernel doesn't track any state it wont really work.  Maybe
 cross-client state sharing isn't useful for performance as Keith and
 Roland argues, but if the kernel doesn't restore state when it sees a
 super-ioctl coming from a different context, who does?

 My guess for one way is to store a buffer object with the current state 
 emission in it, and submit it with the superioctl maybe, and if we have 
 lost context emit it before the batchbuffer..
 
 The way drivers actually work at the moment is to emit a full state as a 
 preamble to each batchbuffer.  Depending on the hardware, this can be 
 pretty low overhead, and it seems that the trend in hardware is to make 
 this operation cheaper and cheaper.  This works fine without the lock.
 
 There is another complimentary trend to support one way or another 
 multiple hardware contexts (obviously nvidia have done this for years), 
 meaning that effectively the hardware (effectively) does the context 
 switches.  This is probably how most cards will end up working in the 
 future, if not already.
 
 Neither of these need a lock for detecting context switches.
 
 Keith
 

I will go this way too for r300/r400/r500 there is not so much registers
change with different contexts and registers which need special treatment
will be handled by the drm (this boils down to where 3d get rendered and
where is the zbuffer and pitch/tile informations on this 2 buffers; this
informations will be embedded in drm_drawable as the cliprect if i am
right :)). It will be up to client to reemit enough state for the card
to be in good shape for its rendering and i don't think it's worthwhile
to provide facilities to keep hw in a given state. So i don't need a lock
and indeed my actual code doesn't use any except for ring buffer emission
(only shared area among different client i can see in my case).

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

DRI2 and lock-less operation

2007-11-21 Thread Kristian Høgsberg

Hi all,

I've been working on the DRI2 implementation recently and I'm starting
to get a little confused, so I figured I'd throw a couple of questions
out to the list.  First of, I wrote up this summary shortly after XD

http://wiki.x.org/wiki/DRI2

which upon re-reading is still pretty much up to date with what I'm
trying to do.  The buzzword summary from the page has

* Lockless
* Always private back buffers
* No clip rects in DRI driver
* No DDX driver part
* Minimal X server part
* Swap buffer and clip rects in kernel
* No SAREA

I've implemented the DRI2 xserver module
(http://cgit.freedesktop.org/~krh/xserver/log/?h=dri2) and the new drm
ioctls that is uses
(http://cgit.freedesktop.org/~krh/drm/log/?h=dri2).  I did the DDX
part for the intel driver and DRI2 initialization consists of doing
drmOpen (this is now up to the DDX driver), initialize the memory
manager and use it to allocate stuff and then call DRI2ScreenInit(),
passing in pScreen and the file descriptor.  Basically, all of
i830_dri.c isn't used in this mode.

It's all delightfully simple, but I'm starting to reconsider whether
the lockless bullet point is realistic.   Note, the drawable lock is
gone, we always render to private back buffers and do swap buffers in
the kernel, so I'm only concerned with the DRI lock here.  The idea
is that since we have the memory manager and the super-ioctl and the X
server now can push cliprects into the kernel in one atomic operation,
we would be able to get rid of the DRI lock.  My overall question,
here is, is that feasible?

I'm trying to figure out how context switches acutally work... the DRI
lock is overloaded as context switcher, and there is code in the
kernel to call out to a chipset specific context switch routine when
the DRI lock is taken... but only ffb uses it... So I'm guessing the
way context switches work today is that the DRI driver grabs the lock
and after potentially updating the cliprects through X protocol, it
emits all the state it depends on to the cards.  Is the state emission
done by just writing out a bunch of registers?  Is this how the X
server works too, except XAA/EXA acceleration doesn't depend on a lot
of state and thus the DDX driver can emit everything for each
operation?

How would this work if we didn't have a lock?  You can't emit the
state and then start rendering without a lock to keep the state in
place...  If the kernel doesn't restore any state, whats the point of
the drm_context_t we pass to the kernel in drmLock?  Should the kernel
know how to restore state (this ties in to the email from jglisse on
state tracking in drm and all the gallium jazz, I guess)?  How do we
identify state to the kernel, and how do we pass it in in the
super-ioctl?  Can we add a list of registers to be written and the
values?  I talked to Dave about it and we agreed that adding a
drm_context_t to the super-ioctl would work, but now I'm thinking if
the kernel doesn't track any state it wont really work.  Maybe
cross-client state sharing isn't useful for performance as Keith and
Roland argues, but if the kernel doesn't restore state when it sees a
super-ioctl coming from a different context, who does?

Sorry for the question-blitz, and I'm sure some of those sound a bit
naiive as I'm not too familiar with the lower levels of many drivers.
But if we're planning to move away from the DRI lock in the near
future, we need to figure what to do here.  Or maybe we're not getting
rid of the DRI lock anytime soon, but that would be a shame given that
we've got everything else lined up.

cheers,
Kristian

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

44 matches

Mail list logo