Re: DRI2 and lock-less operation
Kristian Høgsberg wrote: On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote: ... I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? I imagine one optimiation you could do with a fixed number of contexts is to assume that loosing the context will be very rare, and just fail the super-ioctl when it happens, and then expect user space to resubmit with state emission preamble. In fact it may work well for single context hardware... I recall having the same discussion in the past; have the superioctl fail so that the client needn't constantly compute the full state restore on every submission may be a performance win for some hardware. All that this requires is a flag to the kernel that says 'this submission reinitializes the hardware', and an error return that says 'lost context'. Exactly. But the super-ioctl is chipset specific and we can decide on the details there on a chipset to chipset basis. If you have input to how the super-ioctl for intel should look like to support lockless operation for current and future intel chipset, I'd love to hear it. And of course we can version our way out of trouble as a last resort. We should do the lockless and context stuff at the same time; doing re-submit would be a bunch of work in the driver that would be wasted. Is it that bad? We will still need the resubmit code for older chipsets that dont have the hardware context support. The drivers already have the code to emit state in case of context loss, that code just needs to be repurposed to generate a batch buffer to prepend to the rendering code. And the rendering code doesn't need to change when resubmitting. Will that work? Right now, we're just trying to get 965 running with ttm; once that's limping along, figuring out how to make it reasonable will be the next task, and hardware context support is clearly a big part of that. Yeah - I'm trying to limit the scope of DRI2 so that we can have redirected direct rendering and GLX1.4 in the tree sooner rather than later (before the end of the year). Maybe the best way to do that is to keep the lock around for now and punt on the lock-less super-ioctl if that really blocks on hardware context support. How far back is hardware contexts supported? There are three ways to support lockless operation - hardware contexts - a full preamble emit per batchbuffer - passing a pair of preamble, payload batchbuffers per ioctl I think all hardware is capable of supporting at least one of these. That said, if the super-ioctl is per-device, then you can make a choice per-device in terms of whether the lock is required or not, which makes things easier. The reality is that most ttm based drivers will do very little differently on a regular lock compared to a contended one -- at most they could decide whether or not to emit a preamble they computed earlier. Keith - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Kristian Høgsberg wrote: Another problem is that it's not just swapbuffer - anything that touches the front buffer have to respect the cliprects - glCopyPixels and glXCopySubBufferMESA - and thus have to happen in the kernel. These don't touch the real swapbuffer, just the fake one. Hence they don't care about cliprects and certainly don't have to happen in the kernel... Keith - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Wed, 2007-11-28 at 09:30 +, Keith Whitwell wrote: Kristian Høgsberg wrote: Another problem is that it's not just swapbuffer - anything that touches the front buffer have to respect the cliprects - glCopyPixels and glXCopySubBufferMESA - and thus have to happen in the kernel. These don't touch the real swapbuffer, just the fake one. Hence they don't care about cliprects and certainly don't have to happen in the kernel... I'm not sure about glCopyPixels, but glXCopySubBufferMESA would most definitely be useless if it didn't copy to the real frontbuffer. -- Earthling Michel Dänzer | http://tungstengraphics.com Libre software enthusiast | Debian, X and DRI developer - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Michel Dänzer wrote: On Wed, 2007-11-28 at 09:30 +, Keith Whitwell wrote: Kristian Høgsberg wrote: Another problem is that it's not just swapbuffer - anything that touches the front buffer have to respect the cliprects - glCopyPixels and glXCopySubBufferMESA - and thus have to happen in the kernel. These don't touch the real swapbuffer, just the fake one. Hence they don't care about cliprects and certainly don't have to happen in the kernel... I'm not sure about glCopyPixels, but glXCopySubBufferMESA would most definitely be useless if it didn't copy to the real frontbuffer. Yes, wasn't paying attention... glxCopySubBufferMESA would do both - copy to the fake front buffer and then trigger a damage-induced update of the real frontbuffer. Neither operation requires the 3d driver know about cliprects, and the damage operation is basically a generalization of the swapbuffer stuff we've been talking about. Keith - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Swapbuffers [was: Re: DRI2 and lock-less operation]
Kristian Høgsberg wrote: On Nov 27, 2007 11:48 AM, Stephane Marchesin [EMAIL PROTECTED] wrote: On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: ... It's all delightfully simple, but I'm starting to reconsider whether the lockless bullet point is realistic. Note, the drawable lock is gone, we always render to private back buffers and do swap buffers in the kernel, so I'm only concerned with the DRI lock here. The idea is that since we have the memory manager and the super-ioctl and the X server now can push cliprects into the kernel in one atomic operation, we would be able to get rid of the DRI lock. My overall question, here is, is that feasible? How do you plan to ensure that X didn't change the cliprects after you emitted them to the DRM ? The idea was that the buffer swap happens in the kernel, triggered by an ioctl. The kernel generates the command stream to execute the swap against the current set of cliprects. The back buffers are always private so the cliprects only come into play when copying from the back buffer to the front buffer. Single buffered visuals are secretly double buffered and implemented the same way. I'm trying to figure now whether it makes more sense to keep cliprects and swapbuffer out of the kernel, which wouldn't change the above much, except the swapbuffer case. I described the idea for swapbuffer in this case in my reply to Thomas: the X server publishes cliprects to the clients through a shared ring buffer, and clients parse the clip rect changes out of this buffer as they need it. When posting a swap buffer request, the buffer head should be included in the super-ioctl so that the kernel can reject stale requests. When that happens, the client must parse the new cliprect info and resubmit an updated swap buffer request. In my ideal world, the entity which knows and cares about cliprects should be the one that does the swapbuffers, or at least is in control of the process. That entity is the X server. Instead of tying ourselves into knots trying to figure out how to get some other entity a sufficiently up-to-date set of cliprects to make this work (which is what was wrong with DRI 1.0), maybe we should try and figure out how to get the X server to efficiently orchestrate swapbuffers. In particular it seems like we have: 1) The X server knows about cliprects. 2) The kernel knows about IRQ reception. 3) The kernel knows how to submit rendering commands to hardware. 4) Userspace is where we want to craft rendering commands. Given the above, what do we think about swapbuffers: a) Swapbuffers is a rendering command b) which depends on cliprect information c) that needs to be fired as soon as possible after an IRQ receipt. So: swapbuffers should be crafted from userspace (a, 4) ... by the X server (b, 1) ... and should be actually fired by the kernel (c, 2, 3) I propose something like: 0) 3D client submits rendering to the kernel and receives back a fence. 1) 3D client wants to do swapbuffers. It sends a message to the X server asking it please do me a swapbuffers after this fence has completed. 2) X server crafts (somehow) commands implementing swapbuffers for this drawable under the current set of cliprects and passes it to the kernel along with the fence. 3) The kernel keeps that batchbuffer to the side until a) the commands associated with the fence have been submitted to hardware. b) the next vblank IRQ arrives. when both of these are true, the kernel simply submits the prepared swapbuffer commands through the lowest latency path to hardware. But what happens if the cliprects change? The 100% perfect solution looks like: The X server knows all about cliprect changes, and can use fences or other mechanisms to keep track of which swapbuffers are outstanding. At the time of a cliprect change, it must create new swapbuffer commandsets for all pending swapbuffers and re-submit those to the kernel. These new sets of commands must be tied to the progress of the X server's own rendering command stream so that the kernel fires the appropriate one to land the swapbuffers to the correct destination as the X server's own rendering flies by. In the simplest case, where the kernel puts commands onto the one true ring as it receives them, the kernel can simply discard the old swapbuffer command. Indeed this is true also if the kernel has a ring-per-context and uses one of those rings to serialize the X server rendering and swapbuffers commands. Note that condition 3a) above is always true in the current i915.o one-true-ring/single-fifo approach to hardware serialization. I think the above can work and seems more straight-forward than many of the proposed alternatives. Keith - SF.Net email is sponsored by: The Future of Linux Business White Paper from
Re: DRI2 and lock-less operation
On 11/28/07, Keith Packard [EMAIL PROTECTED] wrote: On Wed, 2007-11-28 at 00:52 +0100, Stephane Marchesin wrote: The third case obviously will never require any kind of state re-emitting. Unless you run out of hardware contexts. Well, in that case we (plan to) bang the state registers from the kernel directly and do a manual state swap. So we still don't need state re-emitting. Stephane - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Tue, 2007-11-27 at 16:50 -0500, Kristian Høgsberg wrote: [...] I'm starting to doubt the cliprects and swapbuffer in the kernel design anyhow. I wasn't going this route originally, but since we already had that in place for i915 vblank buffer swaps, I figured we might as well go that route. I'm not sure the buffer swap from the vblank tasklet is really necessary to begin with... are there benchmarks showing that the irq-userspace latency was too big for this to work in userspace? Yes, I implemented this on an i945 system where waiting for vertical blank and then submitting the buffer swap blit from userspace resulted in constant tearing. -- Earthling Michel Dänzer | http://tungstengraphics.com Libre software enthusiast | Debian, X and DRI developer - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote: On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote: ... I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? That's not supported, every GLX context has its own DRM context. -- Earthling Michel Dänzer | http://tungstengraphics.com Libre software enthusiast | Debian, X and DRI developer - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: Swapbuffers [was: Re: DRI2 and lock-less operation]
On 11/28/07, Keith Whitwell [EMAIL PROTECTED] wrote: In my ideal world, the entity which knows and cares about cliprects should be the one that does the swapbuffers, or at least is in control of the process. That entity is the X server. Instead of tying ourselves into knots trying to figure out how to get some other entity a sufficiently up-to-date set of cliprects to make this work (which is what was wrong with DRI 1.0), maybe we should try and figure out how to get the X server to efficiently orchestrate swapbuffers. In particular it seems like we have: 1) The X server knows about cliprects. 2) The kernel knows about IRQ reception. 3) The kernel knows how to submit rendering commands to hardware. 4) Userspace is where we want to craft rendering commands. Given the above, what do we think about swapbuffers: a) Swapbuffers is a rendering command b) which depends on cliprect information c) that needs to be fired as soon as possible after an IRQ receipt. So: swapbuffers should be crafted from userspace (a, 4) ... by the X server (b, 1) ... and should be actually fired by the kernel (c, 2, 3) Well, on nvidia hw, you don't even need to fire from the kernel (thanks to a special fifo command that waits for vsync). So I'd love it if going through the kernel for swapbuffers was abstracted by the interface. I propose something like: 0) 3D client submits rendering to the kernel and receives back a fence. 1) 3D client wants to do swapbuffers. It sends a message to the X server asking it please do me a swapbuffers after this fence has completed. 2) X server crafts (somehow) commands implementing swapbuffers for this drawable under the current set of cliprects and passes it to the kernel along with the fence. 3) The kernel keeps that batchbuffer to the side until a) the commands associated with the fence have been submitted to hardware. b) the next vblank IRQ arrives. when both of these are true, the kernel simply submits the prepared swapbuffer commands through the lowest latency path to hardware. But what happens if the cliprects change? The 100% perfect solution looks like: The X server knows all about cliprect changes, and can use fences or other mechanisms to keep track of which swapbuffers are outstanding. At the time of a cliprect change, it must create new swapbuffer commandsets for all pending swapbuffers and re-submit those to the kernel. These new sets of commands must be tied to the progress of the X server's own rendering command stream so that the kernel fires the appropriate one to land the swapbuffers to the correct destination as the X server's own rendering flies by. Yes that was the basis for my thinking as well. By inserting the swapbuffers into the normal flow of X commands, we remove the need for syncing with the X server at swapbuffer time. In the simplest case, where the kernel puts commands onto the one true ring as it receives them, the kernel can simply discard the old swapbuffer command. Indeed this is true also if the kernel has a ring-per-context and uses one of those rings to serialize the X server rendering and swapbuffers commands. Come on, admit that's a hack to get 100'000 fps in glxgears :) Note that condition 3a) above is always true in the current i915.o one-true-ring/single-fifo approach to hardware serialization. Yes, I think those details of how to wait should be left driver-dependent and abstracted in user space. So that we have the choice of calling the kernel, doing it from user space only, relying on a single fifo or whatever. I think the above can work and seems more straight-forward than many of the proposed alternatives. This is what I want to do too. Especially since in the nvidia case we don't have the issue of routing vblank interrupts to user space for that. So, the only issue I'm worried about is the latency induced by this approach. When the DRM does the swaps you can ensure it'll get executed pretty fast. If X has been stuffing piles of commands into its command buffer, it might not be so fast. What this means is that 3D might be slowed down by 2D rendering (think especially of the case of EXA fallbacks which will sync your fifo). In that case, ensuring a no-fallback EXA would become relevant in achieving smooth 3D performance. But at least it solves the issue of sluggish OpenGL window moves and resizes (/me looks at the nvidia binary driver behaviour). Stephane - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net
Re: Swapbuffers [was: Re: DRI2 and lock-less operation]
Stephane Marchesin wrote: On 11/28/07, *Keith Whitwell* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: In my ideal world, the entity which knows and cares about cliprects should be the one that does the swapbuffers, or at least is in control of the process. That entity is the X server. Instead of tying ourselves into knots trying to figure out how to get some other entity a sufficiently up-to-date set of cliprects to make this work (which is what was wrong with DRI 1.0), maybe we should try and figure out how to get the X server to efficiently orchestrate swapbuffers. In particular it seems like we have: 1) The X server knows about cliprects. 2) The kernel knows about IRQ reception. 3) The kernel knows how to submit rendering commands to hardware. 4) Userspace is where we want to craft rendering commands. Given the above, what do we think about swapbuffers: a) Swapbuffers is a rendering command b) which depends on cliprect information c) that needs to be fired as soon as possible after an IRQ receipt. So: swapbuffers should be crafted from userspace (a, 4) ... by the X server (b, 1) ... and should be actually fired by the kernel (c, 2, 3) Well, on nvidia hw, you don't even need to fire from the kernel (thanks to a special fifo command that waits for vsync). So I'd love it if going through the kernel for swapbuffers was abstracted by the interface. As I suggested elsewhere, I think that you're probably going to need this even on nvidia hardware. I propose something like: 0) 3D client submits rendering to the kernel and receives back a fence. 1) 3D client wants to do swapbuffers. It sends a message to the X server asking it please do me a swapbuffers after this fence has completed. 2) X server crafts (somehow) commands implementing swapbuffers for this drawable under the current set of cliprects and passes it to the kernel along with the fence. 3) The kernel keeps that batchbuffer to the side until a) the commands associated with the fence have been submitted to hardware. b) the next vblank IRQ arrives. when both of these are true, the kernel simply submits the prepared swapbuffer commands through the lowest latency path to hardware. But what happens if the cliprects change? The 100% perfect solution looks like: The X server knows all about cliprect changes, and can use fences or other mechanisms to keep track of which swapbuffers are outstanding. At the time of a cliprect change, it must create new swapbuffer commandsets for all pending swapbuffers and re-submit those to the kernel. These new sets of commands must be tied to the progress of the X server's own rendering command stream so that the kernel fires the appropriate one to land the swapbuffers to the correct destination as the X server's own rendering flies by. Yes that was the basis for my thinking as well. By inserting the swapbuffers into the normal flow of X commands, we remove the need for syncing with the X server at swapbuffer time. The very simplest way would be just to have the X server queue it up like normal blits and not even involve the kernel. I'm not proposing this. I believe such an approach will fail for the sync-to-vblank case due to latency issues - even (I suspect) with hardware-wait-for-vblank. Rather, I'm describing a mechanism that allows a pre-prepared swapbuffer command to be injected into the X command stream (one way or another) with the guarantee that it will encode the correct cliprects, but which will avoid stalling the command queue in the meantime. In the simplest case, where the kernel puts commands onto the one true ring as it receives them, the kernel can simply discard the old swapbuffer command. Indeed this is true also if the kernel has a ring-per-context and uses one of those rings to serialize the X server rendering and swapbuffers commands. Come on, admit that's a hack to get 100'000 fps in glxgears :) I'm not talking about discarding the whole swap operation, just the version of the swap command buffer that pertained to the old cliprects. Every swap is still being performed. You do raise a good point though -- we currently throttle the 3d driver based on swapbuffer fences. There would need to be some equivalent mechanism to achieve this. Note that condition 3a) above is always true in the current i915.o one-true-ring/single-fifo approach to hardware serialization. Yes, I think those details of how to wait should be left driver-dependent and abstracted in user space. So that we have the choice of calling the kernel, doing it from user space only, relying on a single fifo or
Re: Swapbuffers [was: Re: DRI2 and lock-less operation]
On Wed, Nov 28, 2007 at 12:43:18PM +0100, Stephane Marchesin wrote: This is what I want to do too. Especially since in the nvidia case we don't have the issue of routing vblank interrupts to user space for that. So, the only issue I'm worried about is the latency induced by this approach. When the DRM does the swaps you can ensure it'll get executed pretty fast. If X has been stuffing piles of commands into its command buffer, it might not be so fast. What this means is that 3D might be slowed down by 2D rendering (think especially of the case of EXA fallbacks which will sync your fifo). In that case, ensuring a no-fallback EXA would become relevant in achieving smooth 3D performance. But at least it solves the issue of sluggish OpenGL window moves and resizes (/me looks at the nvidia binary driver behaviour). Stephane I likely got problem with my mail as i think my previous mail didn't get through. Anyway i am all for having X server responsible for swapping buffer. For solving a part of the above problem we might have two context (fifo) for X server: one for X drawing, one for X swapping buffer. The swap context (fifo) is a top priority things and should preempt others context (fifo). An outcome of this is that we might like a simple gpu scheduler for such case (and maybe other in the future) but obviously such scheduler will be highly hw dependent. Cheers, Jerome Glisse - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Michel Dänzer wrote: On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote: On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote: ... I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? That's not supported, every GLX context has its own DRM context. It needs to be. DRI applications should have a single fd open to drm, otherwise shared buffer operation will be problematic. /Thomas - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 26, 2007 6:51 PM, Jerome Glisse [EMAIL PROTECTED] wrote: ... Sounds like this will all work out after all :) Kristian Yes, between i started looking at your dri2 branch and didn't find dri2 branch in your intel ddx repository did you pushed the code anywhere else ? Would help me to see what i need to do for dri2 in ddx. I didn't push that yet, but I've attached the patch as it looks right now. It's still very much work-in-progress. Kristian commit 3a16f83c278902ff673bc21272a3b95eb6237dab Author: Kristian Høgsberg [EMAIL PROTECTED] Date: Mon Nov 19 17:31:02 2007 -0500 dri2 wip diff --git a/src/i830.h b/src/i830.h index 57f0544..ba20df8 100644 --- a/src/i830.h +++ b/src/i830.h @@ -295,6 +295,12 @@ enum last_3d { LAST_3D_ROTATION }; +enum dri_type { +DRI_TYPE_NONE, +DRI_TYPE_XF86DRI, +DRI_TYPE_DRI2 +}; + typedef struct _I830Rec { unsigned char *MMIOBase; unsigned char *GTTBase; @@ -456,7 +462,7 @@ typedef struct _I830Rec { CARD32 samplerstate[6]; Bool directRenderingDisabled; /* DRI disabled in PreInit. */ - Bool directRenderingEnabled; /* DRI enabled this generation. */ + enum dri_type directRendering; /* Type of DRI enabled this generation. */ #ifdef XF86DRI Bool directRenderingOpen; diff --git a/src/i830_accel.c b/src/i830_accel.c index 4d9ea79..e38897e 100644 --- a/src/i830_accel.c +++ b/src/i830_accel.c @@ -136,7 +136,7 @@ I830WaitLpRing(ScrnInfoPtr pScrn, int n, int timeout_millis) i830_dump_error_state(pScrn); ErrorF(space: %d wanted %d\n, ring-space, n); #ifdef XF86DRI - if (pI830-directRenderingEnabled) { + if (pI830-directRendering == DRI_TYPE_XF86DRI) { DRIUnlock(screenInfo.screens[pScrn-scrnIndex]); DRICloseScreen(screenInfo.screens[pScrn-scrnIndex]); } @@ -176,7 +176,7 @@ I830Sync(ScrnInfoPtr pScrn) #ifdef XF86DRI /* VT switching tries to do this. */ - if (!pI830-LockHeld pI830-directRenderingEnabled) { + if (!pI830-LockHeld pI830-directRendering != DRI_TYPE_NONE) { return; } #endif diff --git a/src/i830_display.c b/src/i830_display.c index d988b86..8b07c51 100644 --- a/src/i830_display.c +++ b/src/i830_display.c @@ -408,7 +408,7 @@ i830PipeSetBase(xf86CrtcPtr crtc, int x, int y) } #ifdef XF86DRI -if (pI830-directRenderingEnabled) { +if (pI830-directRendering == DRI_TYPE_XF86DRI) { drmI830Sarea *sPriv = (drmI830Sarea *) DRIGetSAREAPrivate(pScrn-pScreen); if (!sPriv) @@ -755,7 +755,7 @@ i830_crtc_dpms(xf86CrtcPtr crtc, int mode) intel_crtc-dpms_mode = mode; #ifdef XF86DRI -if (pI830-directRenderingEnabled) { +if (pI830-directRendering == DRI_TYPE_XF86DRI) { drmI830Sarea *sPriv = (drmI830Sarea *) DRIGetSAREAPrivate(pScrn-pScreen); Bool enabled = crtc-enabled mode != DPMSModeOff; diff --git a/src/i830_dri.c b/src/i830_dri.c index 4d3458f..511b183 100644 --- a/src/i830_dri.c +++ b/src/i830_dri.c @@ -1236,6 +1236,10 @@ I830DRIInitBuffers(WindowPtr pWin, RegionPtr prgn, CARD32 index) i830MarkSync(pScrn); } +/* FIXME: fix this for real */ +#define ALLOCATE_LOCAL(size) xalloc(size) +#define DEALLOCATE_LOCAL(ptr) xfree(ptr) + /* This routine is a modified form of XAADoBitBlt with the calls to * ScreenToScreenBitBlt built in. My routine has the prgnSrc as source * instead of destination. My origin is upside down so the ydir cases @@ -1748,7 +1752,7 @@ I830DRISetVBlankInterrupt (ScrnInfoPtr pScrn, Bool on) if (!pI830-want_vblank_interrupts) on = FALSE; -if (pI830-directRenderingEnabled pI830-drmMinor = 5) { +if (pI830-directRendering == DRI_TYPE_XF86DRI pI830-drmMinor = 5) { if (on) { if (xf86_config-num_crtc 1 xf86_config-crtc[1]-enabled) if (pI830-drmMinor = 6) @@ -1775,7 +1779,7 @@ I830DRILock(ScrnInfoPtr pScrn) { I830Ptr pI830 = I830PTR(pScrn); - if (pI830-directRenderingEnabled !pI830-LockHeld) { + if (pI830-directRendering == DRI_TYPE_XF86DRI !pI830-LockHeld) { DRILock(screenInfo.screens[pScrn-scrnIndex], 0); pI830-LockHeld = 1; I830RefreshRing(pScrn); @@ -1792,7 +1796,7 @@ I830DRIUnlock(ScrnInfoPtr pScrn) { I830Ptr pI830 = I830PTR(pScrn); - if (pI830-directRenderingEnabled pI830-LockHeld) { + if (pI830-directRendering == DRI_TYPE_XF86DRI pI830-LockHeld) { DRIUnlock(screenInfo.screens[pScrn-scrnIndex]); pI830-LockHeld = 0; } diff --git a/src/i830_driver.c b/src/i830_driver.c index eacaefc..08a5904 100644 --- a/src/i830_driver.c +++ b/src/i830_driver.c @@ -208,6 +208,10 @@ USE OR OTHER DEALINGS IN THE SOFTWARE. #endif #endif +#ifdef DRI2 +#include dri2.h +#endif + #ifdef I830_USE_EXA const char *I830exaSymbols[] = { exaGetVersion, @@ -1467,7 +1471,7 @@ I830PreInit(ScrnInfoPtr pScrn, int flags) pI830-directRenderingDisabled = !xf86ReturnOptValBool(pI830-Options, OPTION_DRI, TRUE); -#ifdef XF86DRI +#if defined(XF86DRI) || defined(DRI2) if (!pI830-directRenderingDisabled) { if (pI830-noAccel
Re: DRI2 and lock-less operation
On Wed, 2007-11-28 at 15:07 +0100, Thomas Hellström wrote: Michel Dänzer wrote: On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote: On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote: ... I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? That's not supported, every GLX context has its own DRM context. It needs to be. DRI applications should have a single fd open to drm, otherwise shared buffer operation will be problematic. Yeah, I misremembered and thought there was a 1:1 correspondence between DRM contexts and fds. Sorry for the confusion. -- Earthling Michel Dänzer | http://tungstengraphics.com Libre software enthusiast | Debian, X and DRI developer - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 28, 2007 6:15 AM, Stephane Marchesin [EMAIL PROTECTED] wrote: On 11/28/07, Keith Packard [EMAIL PROTECTED] wrote: On Wed, 2007-11-28 at 00:52 +0100, Stephane Marchesin wrote: The third case obviously will never require any kind of state re-emitting. Unless you run out of hardware contexts. Well, in that case we (plan to) bang the state registers from the kernel directly and do a manual state swap. So we still don't need state re-emitting. Well, banging the state registers from the kernel sounds like state emission to me - you mean userspace state emission wont be needed?. How does the kernel know how to restore the state? Can you swap a previously set hardware state out to vram/hostmem? I guess you can always just read out the registers you need... Kristian - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Kristian Høgsberg wrote: On Nov 22, 2007 4:03 AM, Thomas Hellström [EMAIL PROTECTED] wrote: ... There are probably various ways to do this, which is another argument for keeping super-ioctls device-specific. For i915-type hardware, Dave's approach above is probably the most attracting one. For Poulsbo, all state is always implicitly included, usually as a reference to a buffer object, so we don't really bother about contexts here. For hardware like the Unichrome, the state is stored in a limited set of registers. Here the drm can keep a copy of those registers for each context and do a smart update on a context switch. However, there are cases where it is very difficult to use cliprects from the drm, though I wouldn't say impossible. The idea is to keep the cliprects in the kernel and only render double buffered. The only code that needs to worry about cliprects is swap buffer, either immediate or synced to vblank. What are the cliprect problems you mention? Kristian Hi, Kristian. Sorry for the late response. What I'm thinking about is the case where the swapbuffers needs to be done by the 3D engine, and with increasingly complex hardware this will at the very least mean some sort of pixel-shader code in the kernel, or perhaps loaded by the kernel firmware interface as a firmware snippet. If we take Poulsbo as an example, the 2D engine is present and open, and swapbuffers can be done using it, but since the 2D- and 3D engines operate separately they are synced in software by the TTM memory manager fence class arbitration code, and we lose performance since we cannot pipeline 3D tasks as we'd want to. If the 3D engine were open, we'd still need a vast amount of code in the kernel to be able to just to a 3D blit. This is even more complicated by the fact that we may not be able to implement 3D blits in the kernel for IP protection reasons. Note that I'm just stating the problem here. I'm not arguing that this should influence the DRI2 design. /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On 11/27/07, Thomas Hellström [EMAIL PROTECTED] wrote: Kristian Høgsberg wrote: On Nov 22, 2007 4:03 AM, Thomas Hellström [EMAIL PROTECTED] wrote: ... There are probably various ways to do this, which is another argument for keeping super-ioctls device-specific. For i915-type hardware, Dave's approach above is probably the most attracting one. For Poulsbo, all state is always implicitly included, usually as a reference to a buffer object, so we don't really bother about contexts here. For hardware like the Unichrome, the state is stored in a limited set of registers. Here the drm can keep a copy of those registers for each context and do a smart update on a context switch. However, there are cases where it is very difficult to use cliprects from the drm, though I wouldn't say impossible. The idea is to keep the cliprects in the kernel and only render double buffered. The only code that needs to worry about cliprects is swap buffer, either immediate or synced to vblank. What are the cliprect problems you mention? Kristian Hi, Kristian. Sorry for the late response. What I'm thinking about is the case where the swapbuffers needs to be done by the 3D engine, and with increasingly complex hardware this will at the very least mean some sort of pixel-shader code in the kernel, or perhaps loaded by the kernel firmware interface as a firmware snippet. If we take Poulsbo as an example, the 2D engine is present and open, and swapbuffers can be done using it, but since the 2D- and 3D engines operate separately they are synced in software by the TTM memory manager fence class arbitration code, and we lose performance since we cannot pipeline 3D tasks as we'd want to. If the 3D engine were open, we'd still need a vast amount of code in the kernel to be able to just to a 3D blit. Then why don't you do it in user space ? You could very well do swapbuffers in the DDX (and cliprects then become a non-issue). This is even more complicated by the fact that we may not be able to implement 3D blits in the kernel for IP protection reasons. Note that I'm just stating the problem here. I'm not arguing that this should influence the DRI2 design. Yes, I don't think we should base the open source DRI design upon specs that have to be kept hidden. Especially if that hardware is not relevant in any way technically. Stephane - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On 11/27/07, Keith Packard [EMAIL PROTECTED] wrote: On Mon, 2007-11-26 at 17:15 -0500, Kristian Høgsberg wrote: -full state I assume you'll deal with hardware which supports multiple contexts and avoid the need to include full state with each buffer? That's what we do already for nouveau, and there are no issues to implementing it. Really that's driver-dependent, like the radeon atom emission mechanism is. Stephane - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: Hi all, I've been working on the DRI2 implementation recently and I'm starting to get a little confused, so I figured I'd throw a couple of questions out to the list. First of, I wrote up this summary shortly after XD http://wiki.x.org/wiki/DRI2 which upon re-reading is still pretty much up to date with what I'm trying to do. The buzzword summary from the page has * Lockless * Always private back buffers * No clip rects in DRI driver * No DDX driver part * Minimal X server part * Swap buffer and clip rects in kernel * No SAREA I've implemented the DRI2 xserver module (http://cgit.freedesktop.org/~krh/xserver/log/?h=dri2) and the new drm ioctls that is uses (http://cgit.freedesktop.org/~krh/drm/log/?h=dri2). I did the DDX part for the intel driver and DRI2 initialization consists of doing drmOpen (this is now up to the DDX driver), initialize the memory manager and use it to allocate stuff and then call DRI2ScreenInit(), passing in pScreen and the file descriptor. Basically, all of i830_dri.c isn't used in this mode. It's all delightfully simple, but I'm starting to reconsider whether the lockless bullet point is realistic. Note, the drawable lock is gone, we always render to private back buffers and do swap buffers in the kernel, so I'm only concerned with the DRI lock here. The idea is that since we have the memory manager and the super-ioctl and the X server now can push cliprects into the kernel in one atomic operation, we would be able to get rid of the DRI lock. My overall question, here is, is that feasible? How do you plan to ensure that X didn't change the cliprects after you emitted them to the DRM ? I'm trying to figure out how context switches acutally work... the DRI lock is overloaded as context switcher, and there is code in the kernel to call out to a chipset specific context switch routine when the DRI lock is taken... but only ffb uses it... So I'm guessing the way context switches work today is that the DRI driver grabs the lock and after potentially updating the cliprects through X protocol, it emits all the state it depends on to the cards. Is the state emission done by just writing out a bunch of registers? Is this how the X server works too, except XAA/EXA acceleration doesn't depend on a lot of state and thus the DDX driver can emit everything for each operation? How would this work if we didn't have a lock? You can't emit the state and then start rendering without a lock to keep the state in place... If the kernel doesn't restore any state, whats the point of the drm_context_t we pass to the kernel in drmLock? Should the kernel know how to restore state (this ties in to the email from jglisse on state tracking in drm and all the gallium jazz, I guess)? How do we identify state to the kernel, and how do we pass it in in the super-ioctl? Can we add a list of registers to be written and the values? I talked to Dave about it and we agreed that adding a drm_context_t to the super-ioctl would work, but now I'm thinking if the kernel doesn't track any state it wont really work. Maybe I seem to recall Eric Anholt did optimize the emission of radeon state atoms a long time ago, and he got some speed improvements. You'd have to ask him how much though. This could give us a rough idea of the performance impact of emitting full state vs needed state, although this doesn't measure the gain of removing lock contention. I might be totally wrong on this, this dates back to a couple of years ago. cross-client state sharing isn't useful for performance as Keith and Roland argues, but if the kernel doesn't restore state when it sees a super-ioctl coming from a different context, who does? Yes it probably has to go to the kernel anyway. Stephane - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 26, 2007 11:15 PM, Keith Packard [EMAIL PROTECTED] wrote: On Mon, 2007-11-26 at 17:15 -0500, Kristian Høgsberg wrote: -full state I assume you'll deal with hardware which supports multiple contexts and avoid the need to include full state with each buffer? As for hardware contexts, I guess there are two cases; hardware that has a fixed set of contexts builtin and hardware where a context is just a piece of video memory that you can point to (effectively an unlimited number of contexts). I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. In the case of unlimited contexts, that may be all that's needed. In the case of a fixed number of contexts, you will need to be able to restore state when you have more software contexts in use that you have hardware contexts. I imagine one optimiation you could do with a fixed number of contexts is to assume that loosing the context will be very rare, and just fail the super-ioctl when it happens, and then expect user space to resubmit with state emission preamble. In fact it may work well for single context hardware... But the super-ioctl is chipset specific and we can decide on the details there on a chipset to chipset basis. If you have input to how the super-ioctl for intel should look like to support lockless operation for current and future intel chipset, I'd love to hear it. And of course we can version our way out of trouble as a last resort. cheers, Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Tue, 2007-11-27 at 14:03 -0500, Kristian Høgsberg wrote: As for hardware contexts, I guess there are two cases; hardware that has a fixed set of contexts builtin and hardware where a context is just a piece of video memory that you can point to (effectively an unlimited number of contexts). Yeah, intel hardware uses memory and has no intrinsic limit to the number of contexts; they just need to be pinned in GTT space and given a unique address. I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). I imagine one optimiation you could do with a fixed number of contexts is to assume that loosing the context will be very rare, and just fail the super-ioctl when it happens, and then expect user space to resubmit with state emission preamble. In fact it may work well for single context hardware... I recall having the same discussion in the past; have the superioctl fail so that the client needn't constantly compute the full state restore on every submission may be a performance win for some hardware. All that this requires is a flag to the kernel that says 'this submission reinitializes the hardware', and an error return that says 'lost context'. But the super-ioctl is chipset specific and we can decide on the details there on a chipset to chipset basis. If you have input to how the super-ioctl for intel should look like to support lockless operation for current and future intel chipset, I'd love to hear it. And of course we can version our way out of trouble as a last resort. We should do the lockless and context stuff at the same time; doing re-submit would be a bunch of work in the driver that would be wasted. Right now, we're just trying to get 965 running with ttm; once that's limping along, figuring out how to make it reasonable will be the next task, and hardware context support is clearly a big part of that. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote: ... I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? I imagine one optimiation you could do with a fixed number of contexts is to assume that loosing the context will be very rare, and just fail the super-ioctl when it happens, and then expect user space to resubmit with state emission preamble. In fact it may work well for single context hardware... I recall having the same discussion in the past; have the superioctl fail so that the client needn't constantly compute the full state restore on every submission may be a performance win for some hardware. All that this requires is a flag to the kernel that says 'this submission reinitializes the hardware', and an error return that says 'lost context'. Exactly. But the super-ioctl is chipset specific and we can decide on the details there on a chipset to chipset basis. If you have input to how the super-ioctl for intel should look like to support lockless operation for current and future intel chipset, I'd love to hear it. And of course we can version our way out of trouble as a last resort. We should do the lockless and context stuff at the same time; doing re-submit would be a bunch of work in the driver that would be wasted. Is it that bad? We will still need the resubmit code for older chipsets that dont have the hardware context support. The drivers already have the code to emit state in case of context loss, that code just needs to be repurposed to generate a batch buffer to prepend to the rendering code. And the rendering code doesn't need to change when resubmitting. Will that work? Right now, we're just trying to get 965 running with ttm; once that's limping along, figuring out how to make it reasonable will be the next task, and hardware context support is clearly a big part of that. Yeah - I'm trying to limit the scope of DRI2 so that we can have redirected direct rendering and GLX1.4 in the tree sooner rather than later (before the end of the year). Maybe the best way to do that is to keep the lock around for now and punt on the lock-less super-ioctl if that really blocks on hardware context support. How far back is hardware contexts supported? Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 27, 2007 10:41 AM, Thomas Hellström [EMAIL PROTECTED] wrote: ... However, there are cases where it is very difficult to use cliprects from the drm, though I wouldn't say impossible. The idea is to keep the cliprects in the kernel and only render double buffered. The only code that needs to worry about cliprects is swap buffer, either immediate or synced to vblank. What are the cliprect problems you mention? Kristian Hi, Kristian. Sorry for the late response. What I'm thinking about is the case where the swapbuffers needs to be done by the 3D engine, and with increasingly complex hardware this will at the very least mean some sort of pixel-shader code in the kernel, or perhaps loaded by the kernel firmware interface as a firmware snippet. And the kernel will have to somehow communicate the list of clip rects to this firmware/pixel-shader in one way or another, maybe fixing up the code or providing a tri-list buffer or something. I don't really know, but it already sounds too complicated for the kernel, in my opinion. Another problem is that it's not just swapbuffer - anything that touches the front buffer have to respect the cliprects - glCopyPixels and glXCopySubBufferMESA - and thus have to happen in the kernel. If we take Poulsbo as an example, the 2D engine is present and open, and swapbuffers can be done using it, but since the 2D- and 3D engines operate separately they are synced in software by the TTM memory manager fence class arbitration code, and we lose performance since we cannot pipeline 3D tasks as we'd want to. If the 3D engine were open, we'd still need a vast amount of code in the kernel to be able to just to a 3D blit. This is even more complicated by the fact that we may not be able to implement 3D blits in the kernel for IP protection reasons. Note that I'm just stating the problem here. I'm not arguing that this should influence the DRI2 design. I understand, but I'm starting to doubt the cliprects and swapbuffer in the kernel design anyhow. I wasn't going this route originally, but since we already had that in place for i915 vblank buffer swaps, I figured we might as well go that route. I'm not sure the buffer swap from the vblank tasklet is really necessary to begin with... are there benchmarks showing that the irq-userspace latency was too big for this to work in userspace? My proposal back at XDS was to have a shared host memory ring buffer where the X server pushes cliprect changes and clients read it out from there and I still think that's nice design. In a lockless world, the superioctl arguments optionally include the buffer head (as a timestamp) so that the kernel can detect whether a swap buffer batchbuffer is stale. If it is, meaning the X server published cliprect changes, the submit fails and the client must recompute the batchbuffer for the swap command and resubmit after parsing the new cliprects. When rendering to a private back buffer, clip rects aren't relevant and so the superioctl wont have the buffer head field set and the kernel will never discard it. I dunno, maybe I'm just rambling... Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 27, 2007 11:48 AM, Stephane Marchesin [EMAIL PROTECTED] wrote: On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: ... It's all delightfully simple, but I'm starting to reconsider whether the lockless bullet point is realistic. Note, the drawable lock is gone, we always render to private back buffers and do swap buffers in the kernel, so I'm only concerned with the DRI lock here. The idea is that since we have the memory manager and the super-ioctl and the X server now can push cliprects into the kernel in one atomic operation, we would be able to get rid of the DRI lock. My overall question, here is, is that feasible? How do you plan to ensure that X didn't change the cliprects after you emitted them to the DRM ? The idea was that the buffer swap happens in the kernel, triggered by an ioctl. The kernel generates the command stream to execute the swap against the current set of cliprects. The back buffers are always private so the cliprects only come into play when copying from the back buffer to the front buffer. Single buffered visuals are secretly double buffered and implemented the same way. I'm trying to figure now whether it makes more sense to keep cliprects and swapbuffer out of the kernel, which wouldn't change the above much, except the swapbuffer case. I described the idea for swapbuffer in this case in my reply to Thomas: the X server publishes cliprects to the clients through a shared ring buffer, and clients parse the clip rect changes out of this buffer as they need it. When posting a swap buffer request, the buffer head should be included in the super-ioctl so that the kernel can reject stale requests. When that happens, the client must parse the new cliprect info and resubmit an updated swap buffer request. Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
In general the problem with the superioctl returning 'fail' is that the client has to then go back in time and figure out what the state preamble would have been at the start of the batchbuffer. Of course the easiest way to do this is to actually precompute the preamble at batchbuffer start time and store it in case the superioctl fails -- in which case, why not pass it to the kernel along with the rest of the batchbuffer and have the kernel decide whether or not to play it? Keith - Original Message From: Kristian Høgsberg [EMAIL PROTECTED] To: Keith Packard [EMAIL PROTECTED] Cc: Jerome Glisse [EMAIL PROTECTED]; Dave Airlie [EMAIL PROTECTED]; dri-devel@lists.sourceforge.net; Keith Whitwell [EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 8:44:48 PM Subject: Re: DRI2 and lock-less operation On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote: ... I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? I imagine one optimiation you could do with a fixed number of contexts is to assume that loosing the context will be very rare, and just fail the super-ioctl when it happens, and then expect user space to resubmit with state emission preamble. In fact it may work well for single context hardware... I recall having the same discussion in the past; have the superioctl fail so that the client needn't constantly compute the full state restore on every submission may be a performance win for some hardware. All that this requires is a flag to the kernel that says 'this submission reinitializes the hardware', and an error return that says 'lost context'. Exactly. But the super-ioctl is chipset specific and we can decide on the details there on a chipset to chipset basis. If you have input to how the super-ioctl for intel should look like to support lockless operation for current and future intel chipset, I'd love to hear it. And of course we can version our way out of trouble as a last resort. We should do the lockless and context stuff at the same time; doing re-submit would be a bunch of work in the driver that would be wasted. Is it that bad? We will still need the resubmit code for older chipsets that dont have the hardware context support. The drivers already have the code to emit state in case of context loss, that code just needs to be repurposed to generate a batch buffer to prepend to the rendering code. And the rendering code doesn't need to change when resubmitting. Will that work? Right now, we're just trying to get 965 running with ttm; once that's limping along, figuring out how to make it reasonable will be the next task, and hardware context support is clearly a big part of that. Yeah - I'm trying to limit the scope of DRI2 so that we can have redirected direct rendering and GLX1.4 in the tree sooner rather than later (before the end of the year). Maybe the best way to do that is to keep the lock around for now and punt on the lock-less super-ioctl if that really blocks on hardware context support. How far back is hardware contexts supported? Kristian - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On 11/27/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: On Nov 27, 2007 11:48 AM, Stephane Marchesin [EMAIL PROTECTED] wrote: On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: ... It's all delightfully simple, but I'm starting to reconsider whether the lockless bullet point is realistic. Note, the drawable lock is gone, we always render to private back buffers and do swap buffers in the kernel, so I'm only concerned with the DRI lock here. The idea is that since we have the memory manager and the super-ioctl and the X server now can push cliprects into the kernel in one atomic operation, we would be able to get rid of the DRI lock. My overall question, here is, is that feasible? How do you plan to ensure that X didn't change the cliprects after you emitted them to the DRM ? The idea was that the buffer swap happens in the kernel, triggered by an ioctl. The kernel generates the command stream to execute the swap against the current set of cliprects. The back buffers are always private so the cliprects only come into play when copying from the back buffer to the front buffer. Single buffered visuals are secretly double buffered and implemented the same way. What if cliprects change between the time you emit them to the fifo and the time the blit gets executed by the card ? Do you sync to the card in-drm, effectively killing performance ? I'm trying to figure now whether it makes more sense to keep cliprects and swapbuffer out of the kernel, which wouldn't change the above much, except the swapbuffer case. I described the idea for swapbuffer in this case in my reply to Thomas: the X server publishes cliprects to the clients through a shared ring buffer, and clients parse the clip rect changes out of this buffer as they need it. When posting a swap buffer request, the buffer head should be included in the super-ioctl so that the kernel can reject stale requests. When that happens, the client must parse the new cliprect info and resubmit an updated swap buffer request. I fail to see how this works with a lockless design. How do you ensure the X server doesn't change cliprects between the time it has written those in the shared ring buffer and the time the DRI client picks them up and has the command fired and actually executed ? Do you lock out the server during that time ? Stephane - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On 11/27/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: Yeah - I'm trying to limit the scope of DRI2 so that we can have redirected direct rendering and GLX1.4 in the tree sooner rather than later (before the end of the year). Maybe the best way to do that is to keep the lock around for now and punt on the lock-less super-ioctl if that really blocks on hardware context support. How far back is hardware contexts supported? Hmm you can't be binary like that. I think there are 3 classes of devices around : - no context support at all : old radeon, old intel - multiple fifos, no hw context switching : newer radeon, newer intel - multiple fifos, hw context switching : all nvidia The third case obviously will never require any kind of state re-emitting. Stephane - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Tue, 2007-11-27 at 15:44 -0500, Kristian Høgsberg wrote: Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? For that, you'd either want 'switch context' ioctls, or context arguments with every request. The former is easy to retrofit, the latter would have to be done right now. All that this requires is a flag to the kernel that says 'this submission reinitializes the hardware', and an error return that says 'lost context'. Exactly. I think Keith's comment that knowing how to reinit is effectively the same as computing the reinitialization buffer may be relevant here. I'm not entirely sure this is true; it might be possible to save higher level information about the state than the ring contents. For Intel, I'm not planning on using this anyway, so I suspect Radeon will be the test case here. We should do the lockless and context stuff at the same time; doing re-submit would be a bunch of work in the driver that would be wasted. Is it that bad? We will still need the resubmit code for older chipsets that dont have the hardware context support. From what I can tell, all Intel chips support MI_SET_CONTEXT, with the possible exception of i830. I'm getting some additional docs for that chip to see what it does, but the i845 docs mention the 'new context support'; dunno if that's new as of i830 or i845... The drivers already have the code to emit state in case of context loss, that code just needs to be repurposed to generate a batch buffer to prepend to the rendering code. And the rendering code doesn't need to change when resubmitting. Will that work? With MI_SET_CONTEXT, you should never lose context, so we'd never need to be able to do this. Yeah - I'm trying to limit the scope of DRI2 so that we can have redirected direct rendering and GLX1.4 in the tree sooner rather than later (before the end of the year). Maybe the best way to do that is to keep the lock around for now and punt on the lock-less super-ioctl if that really blocks on hardware context support. How far back is hardware contexts supported? Hardware context support actually looks really easy to add; it's just a page in GTT per drm client, then an extra instruction in the ring when context switching is required. I'll have to see if the i830 supports it though. The current mechanism actually looks a lot harder to handle; I don't know why the driver didn't use MI_SET_CONTEXT from the very start. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Wed, 2007-11-28 at 00:52 +0100, Stephane Marchesin wrote: The third case obviously will never require any kind of state re-emitting. Unless you run out of hardware contexts. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Tue, Nov 27, 2007 at 03:44:48PM -0500, Kristian Høgsberg wrote: On Nov 27, 2007 2:30 PM, Keith Packard [EMAIL PROTECTED] wrote: ... I both cases, the kernel will need to know how to activate a given context and the context handle should be part of the super ioctl arguments. We needn't expose the contexts to user-space, just provide a virtual consistent device and manage contexts in the kernel. We could add the ability to manage contexts from user space for cases where that makes sense (like, perhaps, in the X server where a context per client may be useful). Oh, right we don't need one per GLContext, just per DRI client, mesa handles switching between GL contexts. What about multithreaded rendering sharing the same drm fd? I imagine one optimiation you could do with a fixed number of contexts is to assume that loosing the context will be very rare, and just fail the super-ioctl when it happens, and then expect user space to resubmit with state emission preamble. In fact it may work well for single context hardware... I recall having the same discussion in the past; have the superioctl fail so that the client needn't constantly compute the full state restore on every submission may be a performance win for some hardware. All that this requires is a flag to the kernel that says 'this submission reinitializes the hardware', and an error return that says 'lost context'. Exactly. But the super-ioctl is chipset specific and we can decide on the details there on a chipset to chipset basis. If you have input to how the super-ioctl for intel should look like to support lockless operation for current and future intel chipset, I'd love to hear it. And of course we can version our way out of trouble as a last resort. We should do the lockless and context stuff at the same time; doing re-submit would be a bunch of work in the driver that would be wasted. Is it that bad? We will still need the resubmit code for older chipsets that dont have the hardware context support. The drivers already have the code to emit state in case of context loss, that code just needs to be repurposed to generate a batch buffer to prepend to the rendering code. And the rendering code doesn't need to change when resubmitting. Will that work? Right now, we're just trying to get 965 running with ttm; once that's limping along, figuring out how to make it reasonable will be the next task, and hardware context support is clearly a big part of that. Yeah - I'm trying to limit the scope of DRI2 so that we can have redirected direct rendering and GLX1.4 in the tree sooner rather than later (before the end of the year). Maybe the best way to do that is to keep the lock around for now and punt on the lock-less super-ioctl if that really blocks on hardware context support. How far back is hardware contexts supported? Kristian Maybe just accept than only drivers converted to dri2 will be lockless ie if you are dri2 you have superiotcl and others things like that as anyway i believe what GLX1.4 give is bit useless without a proper driver (TTM at least). Cheers, Jerome Glisse - SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Wed, Nov 28, 2007 at 12:43:41AM +0100, Stephane Marchesin wrote: On 11/27/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: On Nov 27, 2007 11:48 AM, Stephane Marchesin [EMAIL PROTECTED] wrote: On 11/22/07, Kristian Høgsberg [EMAIL PROTECTED] wrote: ... It's all delightfully simple, but I'm starting to reconsider whether the lockless bullet point is realistic. Note, the drawable lock is gone, we always render to private back buffers and do swap buffers in the kernel, so I'm only concerned with the DRI lock here. The idea is that since we have the memory manager and the super-ioctl and the X server now can push cliprects into the kernel in one atomic operation, we would be able to get rid of the DRI lock. My overall question, here is, is that feasible? How do you plan to ensure that X didn't change the cliprects after you emitted them to the DRM ? The idea was that the buffer swap happens in the kernel, triggered by an ioctl. The kernel generates the command stream to execute the swap against the current set of cliprects. The back buffers are always private so the cliprects only come into play when copying from the back buffer to the front buffer. Single buffered visuals are secretly double buffered and implemented the same way. What if cliprects change between the time you emit them to the fifo and the time the blit gets executed by the card ? Do you sync to the card in-drm, effectively killing performance ? I'm trying to figure now whether it makes more sense to keep cliprects and swapbuffer out of the kernel, which wouldn't change the above much, except the swapbuffer case. I described the idea for swapbuffer in this case in my reply to Thomas: the X server publishes cliprects to the clients through a shared ring buffer, and clients parse the clip rect changes out of this buffer as they need it. When posting a swap buffer request, the buffer head should be included in the super-ioctl so that the kernel can reject stale requests. When that happens, the client must parse the new cliprect info and resubmit an updated swap buffer request. I fail to see how this works with a lockless design. How do you ensure the X server doesn't change cliprects between the time it has written those in the shared ring buffer and the time the DRI client picks them up and has the command fired and actually executed ? Do you lock out the server during that time ? Stephane All this is starting to confuse me a bit :) So, if i am right, all client render to private buffer and the you want to blit this client into the front buffer. In composite world it's up to the compositor to do this and so you don't have to care about cliprect. I think we can have somekind of dumb default compositor that would handle this blit directly into the X server. And this compositor might even not use cliprect. For window|pixmap|... resizing can the following scheme fit: Single buffered path (ie no backbuffer but client still render to a private buffer). 1) X get the resize event 2) X ask drm to allocate new drawable with the new size 3) X enqueu a query to drm which copy current buffer content into the new one and also update drawable so further rendering request will happen in the new buffer 4) X start using the new buffer when compositing into the scanout buffer So you might see rendering artifact but i guess this has to be exepected in single buffered applications where size change can happen at any time. Double buffered path: 1) X get the resize event 2) X allocate new front buffer and blit old front buffer into it (i might be wrong and X might not need to preserver content on resize) so X can start blitting using new buffer size but with old content. X allocate new back buffer 3) X enqueu a query to drm to change drawable back buffer If there is no pending rendering on the current back buffer (old size): 4) drm update drawable so back buffer is the new one (with new size) Finished with resizing If there is pending rendering on the current back buffer (old size): 4) On next swapbuffer X blit back buffer into the front buffer (if there is a need to preserve content) deallocate back buffer and allocate a new one with new size (so front buffer stay the same just a part of is updated) Finished with resizing So this follow comment Keith made on the wiki about resizing, i think it's best strategy even though if redrawing the window is slow then one might never see proper content if he is constantly resizing the window (but we can't do much against broken user ;)) Of course in this scheme you do the blitting from private client buffer (wether its the front buffer or the only buffer for single buffered app) in client space ie in the X server (so no need for cliprect in kernel). Note that you blit as soon as step 3 finished for single buffering and as soon as step 2 finished for double buffering. Does
Re: DRI2 and lock-less operation
On Nov 22, 2007 4:23 AM, Keith Whitwell [EMAIL PROTECTED] wrote: ... My guess for one way is to store a buffer object with the current state emission in it, and submit it with the superioctl maybe, and if we have lost context emit it before the batchbuffer.. The way drivers actually work at the moment is to emit a full state as a preamble to each batchbuffer. Depending on the hardware, this can be pretty low overhead, and it seems that the trend in hardware is to make this operation cheaper and cheaper. This works fine without the lock. There is another complimentary trend to support one way or another multiple hardware contexts (obviously nvidia have done this for years), meaning that effectively the hardware (effectively) does the context switches. This is probably how most cards will end up working in the future, if not already. Neither of these need a lock for detecting context switches. Sure enough, but the problem is that without the lock userspace can't say oops, I lost the context, let me prepend this state emission preamble to the batchbuffer. in a race free way. If we want conditional state emission, we need to make that decision in the kernel. For example, the super ioctl could send the state emission code as a separate buffer and also include the expected context handle. This lets the kernel compare the context handle supplied in the super ioctl with the most recently active context handle, and if they differ, the kernel queues the state emission buffer first and then the rendering buffer. If the context handles match, the kernel just queues the rendering batch buffer. However, this means that user space must prepare the state emission code for each submission, whether or not it will actually be used. I'm not sure if this is too much overhead or if it's negligible? Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Kristian Høgsberg wrote: On Nov 22, 2007 4:23 AM, Keith Whitwell [EMAIL PROTECTED] wrote: ... My guess for one way is to store a buffer object with the current state emission in it, and submit it with the superioctl maybe, and if we have lost context emit it before the batchbuffer.. The way drivers actually work at the moment is to emit a full state as a preamble to each batchbuffer. Depending on the hardware, this can be pretty low overhead, and it seems that the trend in hardware is to make this operation cheaper and cheaper. This works fine without the lock. There is another complimentary trend to support one way or another multiple hardware contexts (obviously nvidia have done this for years), meaning that effectively the hardware (effectively) does the context switches. This is probably how most cards will end up working in the future, if not already. Neither of these need a lock for detecting context switches. Sure enough, but the problem is that without the lock userspace can't say oops, I lost the context, let me prepend this state emission preamble to the batchbuffer. in a race free way. If we want conditional state emission, we need to make that decision in the kernel. The cases I describe above don't try to do this, but if you really wanted to, the way to do it would be to have userspace always emit the preamble but pass two offsets to the kernel, one at the start of the preamble, the other after it. Then the kernel can choose. I don't think there's a great deal to be gained from this optimization, though. For example, the super ioctl could send the state emission code as a separate buffer and also include the expected context handle. This lets the kernel compare the context handle supplied in the super ioctl with the most recently active context handle, and if they differ, the kernel queues the state emission buffer first and then the rendering buffer. If the context handles match, the kernel just queues the rendering batch buffer. However, this means that user space must prepare the state emission code for each submission, whether or not it will actually be used. I'm not sure if this is too much overhead or if it's negligible? I think both preparing it on CPU and executing it on GPU are likely to be pretty negligible, but some experimentation on a system with just a single app running should show this quickly one way or another. Keith - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote: ... I will go this way too for r300/r400/r500 there is not so much registers change with different contexts and registers which need special treatment will be handled by the drm (this boils down to where 3d get rendered and where is the zbuffer and pitch/tile informations on this 2 buffers; this informations will be embedded in drm_drawable as the cliprect if i am right :)). It will be up to client to reemit enough state for the card to be in good shape for its rendering and i don't think it's worthwhile to provide facilities to keep hw in a given state. Are you suggesting we emit the state for every batchbuffer submission? As I wrote in my reply to Keith, without a lock, you can't check that the state is what you expect and the submit a batchbuffer from user space. The check has to be done in the kernel, and then kernel will then emit the state conditionally. And even if this scheme can eliminate unecessary state emission, the state still have to be passed to the kernel with every batch buffer submission, in case the kernel need to emit it. So i don't need a lock and indeed my actual code doesn't use any except for ring buffer emission (only shared area among different client i can see in my case). So you do need a lock? Could the ring buffer management be done in the drm by the super-ioctl code and would that eliminate the need for a sarea? Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 22, 2007 4:03 AM, Thomas Hellström [EMAIL PROTECTED] wrote: ... There are probably various ways to do this, which is another argument for keeping super-ioctls device-specific. For i915-type hardware, Dave's approach above is probably the most attracting one. For Poulsbo, all state is always implicitly included, usually as a reference to a buffer object, so we don't really bother about contexts here. For hardware like the Unichrome, the state is stored in a limited set of registers. Here the drm can keep a copy of those registers for each context and do a smart update on a context switch. However, there are cases where it is very difficult to use cliprects from the drm, though I wouldn't say impossible. The idea is to keep the cliprects in the kernel and only render double buffered. The only code that needs to worry about cliprects is swap buffer, either immediate or synced to vblank. What are the cliprect problems you mention? Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Kristian Høgsberg wrote: On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote: ... I will go this way too for r300/r400/r500 there is not so much registers change with different contexts and registers which need special treatment will be handled by the drm (this boils down to where 3d get rendered and where is the zbuffer and pitch/tile informations on this 2 buffers; this informations will be embedded in drm_drawable as the cliprect if i am right :)). It will be up to client to reemit enough state for the card to be in good shape for its rendering and i don't think it's worthwhile to provide facilities to keep hw in a given state. Are you suggesting we emit the state for every batchbuffer submission? As I wrote in my reply to Keith, without a lock, you can't check that the state is what you expect and the submit a batchbuffer from user space. The check has to be done in the kernel, and then kernel will then emit the state conditionally. And even if this scheme can eliminate unecessary state emission, the state still have to be passed to the kernel with every batch buffer submission, in case the kernel need to emit it. I didn't explained myself properly, i did mean that it's up to the client to reemit all state in each call to superioctl so there will be full state emission but i won't check it ie if it doesn't emit full state then userspace can just expect to buggy rendering. Meanwhile there is few things i don't want the userspace to mess with as they likely need special treatment like the offset in ram where 3d rendering is going to happen, where are ancillary buffers, ... i expect to have all this informations attached to a drm_drawable (rendering buffer, ancillary buffer, ...) so each call of superioctl will have this: -drm_drawable (where rendering will be) -full state -additional buffers needed for rendering (vertex buffer, textures, ...) So i don't need a lock and indeed my actual code doesn't use any except for ring buffer emission (only shared area among different client i can see in my case). So you do need a lock? Could the ring buffer management be done in the drm by the super-ioctl code and would that eliminate the need for a sarea? This ring lock is for internal use only, so if one client is in superioctl and another is emitting a fence both need to write to ring but with different path to avoid any userspace lock i have a kernel lock which any functions writing to ring need to take; as writing to ring should be short this shouldn't hurt. So no i don't need a lock and i don't want a lock :) Hope i expressed myself better :) Cheers, Jerome Glisse - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Nov 26, 2007 3:40 PM, Jerome Glisse [EMAIL PROTECTED] wrote: Kristian Høgsberg wrote: On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote: ... I will go this way too for r300/r400/r500 there is not so much registers change with different contexts and registers which need special treatment will be handled by the drm (this boils down to where 3d get rendered and where is the zbuffer and pitch/tile informations on this 2 buffers; this informations will be embedded in drm_drawable as the cliprect if i am right :)). It will be up to client to reemit enough state for the card to be in good shape for its rendering and i don't think it's worthwhile to provide facilities to keep hw in a given state. Are you suggesting we emit the state for every batchbuffer submission? As I wrote in my reply to Keith, without a lock, you can't check that the state is what you expect and the submit a batchbuffer from user space. The check has to be done in the kernel, and then kernel will then emit the state conditionally. And even if this scheme can eliminate unecessary state emission, the state still have to be passed to the kernel with every batch buffer submission, in case the kernel need to emit it. I didn't explained myself properly, i did mean that it's up to the client to reemit all state in each call to superioctl so there will be full state emission but i won't check it ie if it doesn't emit full state then userspace can just expect to buggy rendering. Meanwhile there is few things i don't want the userspace to mess with as they likely need special treatment like the offset in ram where 3d rendering is going to happen, where are ancillary buffers, ... i expect to have all this informations attached to a drm_drawable (rendering buffer, ancillary buffer, ...) so each call of superioctl will have this: -drm_drawable (where rendering will be) -full state -additional buffers needed for rendering (vertex buffer, textures, ...) Great, thanks for clarifying. Sounds good. So i don't need a lock and indeed my actual code doesn't use any except for ring buffer emission (only shared area among different client i can see in my case). So you do need a lock? Could the ring buffer management be done in the drm by the super-ioctl code and would that eliminate the need for a sarea? This ring lock is for internal use only, so if one client is in superioctl and another is emitting a fence both need to write to ring but with different path to avoid any userspace lock i have a kernel lock which any functions writing to ring need to take; as writing to ring should be short this shouldn't hurt. So no i don't need a lock and i don't want a lock :) Hope i expressed myself better :) Hehe, I guess I've been a bit unclear too: what I want to make sure we can get rid of is the *userspace* lock that's currently shared between the X server and the DRI clients. What hear you saying above is that you still need a kernel side lock to synchronize access to the ring buffer, which of course is required. The ring buffer and the lock that protects is lives in the kernel and user space can't access it directly. When emitting batchbuffers or fences, the ioctl() will need to take the lock, but will always release it before returning back to userspace. Sounds like this will all work out after all :) Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Kristian Høgsberg wrote: On Nov 26, 2007 3:40 PM, Jerome Glisse [EMAIL PROTECTED] wrote: Kristian Høgsberg wrote: On Nov 22, 2007 5:37 AM, Jerome Glisse [EMAIL PROTECTED] wrote: ... I will go this way too for r300/r400/r500 there is not so much registers change with different contexts and registers which need special treatment will be handled by the drm (this boils down to where 3d get rendered and where is the zbuffer and pitch/tile informations on this 2 buffers; this informations will be embedded in drm_drawable as the cliprect if i am right :)). It will be up to client to reemit enough state for the card to be in good shape for its rendering and i don't think it's worthwhile to provide facilities to keep hw in a given state. Are you suggesting we emit the state for every batchbuffer submission? As I wrote in my reply to Keith, without a lock, you can't check that the state is what you expect and the submit a batchbuffer from user space. The check has to be done in the kernel, and then kernel will then emit the state conditionally. And even if this scheme can eliminate unecessary state emission, the state still have to be passed to the kernel with every batch buffer submission, in case the kernel need to emit it. I didn't explained myself properly, i did mean that it's up to the client to reemit all state in each call to superioctl so there will be full state emission but i won't check it ie if it doesn't emit full state then userspace can just expect to buggy rendering. Meanwhile there is few things i don't want the userspace to mess with as they likely need special treatment like the offset in ram where 3d rendering is going to happen, where are ancillary buffers, ... i expect to have all this informations attached to a drm_drawable (rendering buffer, ancillary buffer, ...) so each call of superioctl will have this: -drm_drawable (where rendering will be) -full state -additional buffers needed for rendering (vertex buffer, textures, ...) Great, thanks for clarifying. Sounds good. So i don't need a lock and indeed my actual code doesn't use any except for ring buffer emission (only shared area among different client i can see in my case). So you do need a lock? Could the ring buffer management be done in the drm by the super-ioctl code and would that eliminate the need for a sarea? This ring lock is for internal use only, so if one client is in superioctl and another is emitting a fence both need to write to ring but with different path to avoid any userspace lock i have a kernel lock which any functions writing to ring need to take; as writing to ring should be short this shouldn't hurt. So no i don't need a lock and i don't want a lock :) Hope i expressed myself better :) Hehe, I guess I've been a bit unclear too: what I want to make sure we can get rid of is the *userspace* lock that's currently shared between the X server and the DRI clients. What hear you saying above is that you still need a kernel side lock to synchronize access to the ring buffer, which of course is required. The ring buffer and the lock that protects is lives in the kernel and user space can't access it directly. When emitting batchbuffers or fences, the ioctl() will need to take the lock, but will always release it before returning back to userspace. Sounds like this will all work out after all :) Kristian Yes, between i started looking at your dri2 branch and didn't find dri2 branch in your intel ddx repository did you pushed the code anywhere else ? Would help me to see what i need to do for dri2 in ddx. Cheers, Jerome Glisse - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
On Mon, 2007-11-26 at 17:15 -0500, Kristian Høgsberg wrote: -full state I assume you'll deal with hardware which supports multiple contexts and avoid the need to include full state with each buffer? -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
I'm trying to figure out how context switches acutally work... the DRI lock is overloaded as context switcher, and there is code in the kernel to call out to a chipset specific context switch routine when the DRI lock is taken... but only ffb uses it... So I'm guessing the way context switches work today is that the DRI driver grabs the lock and after potentially updating the cliprects through X protocol, it emits all the state it depends on to the cards. Is the state emission done by just writing out a bunch of registers? Is this how the X server works too, except XAA/EXA acceleration doesn't depend on a lot of state and thus the DDX driver can emit everything for each operation? So yes userspaces notices context has changed and just re-emits everything into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at least there is an invarient state emission functions that notices what the context was and what the last server 3D users was (EXA or Xv texturing) and just dumps the state into the batchbuffer.. (or currently into the ring) How would this work if we didn't have a lock? You can't emit the state and then start rendering without a lock to keep the state in place... If the kernel doesn't restore any state, whats the point of the drm_context_t we pass to the kernel in drmLock? Should the kernel know how to restore state (this ties in to the email from jglisse on state tracking in drm and all the gallium jazz, I guess)? How do we identify state to the kernel, and how do we pass it in in the super-ioctl? Can we add a list of registers to be written and the values? I talked to Dave about it and we agreed that adding a drm_context_t to the super-ioctl would work, but now I'm thinking if the kernel doesn't track any state it wont really work. Maybe cross-client state sharing isn't useful for performance as Keith and Roland argues, but if the kernel doesn't restore state when it sees a super-ioctl coming from a different context, who does? My guess for one way is to store a buffer object with the current state emission in it, and submit it with the superioctl maybe, and if we have lost context emit it before the batchbuffer.. Dave. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Dave Airlie wrote: I'm trying to figure out how context switches acutally work... the DRI lock is overloaded as context switcher, and there is code in the kernel to call out to a chipset specific context switch routine when the DRI lock is taken... but only ffb uses it... So I'm guessing the way context switches work today is that the DRI driver grabs the lock and after potentially updating the cliprects through X protocol, it emits all the state it depends on to the cards. Is the state emission done by just writing out a bunch of registers? Is this how the X server works too, except XAA/EXA acceleration doesn't depend on a lot of state and thus the DDX driver can emit everything for each operation? So yes userspaces notices context has changed and just re-emits everything into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at least there is an invarient state emission functions that notices what the context was and what the last server 3D users was (EXA or Xv texturing) and just dumps the state into the batchbuffer.. (or currently into the ring) How would this work if we didn't have a lock? You can't emit the state and then start rendering without a lock to keep the state in place... If the kernel doesn't restore any state, whats the point of the drm_context_t we pass to the kernel in drmLock? Should the kernel know how to restore state (this ties in to the email from jglisse on state tracking in drm and all the gallium jazz, I guess)? How do we identify state to the kernel, and how do we pass it in in the super-ioctl? Can we add a list of registers to be written and the values? I talked to Dave about it and we agreed that adding a drm_context_t to the super-ioctl would work, but now I'm thinking if the kernel doesn't track any state it wont really work. Maybe cross-client state sharing isn't useful for performance as Keith and Roland argues, but if the kernel doesn't restore state when it sees a super-ioctl coming from a different context, who does? My guess for one way is to store a buffer object with the current state emission in it, and submit it with the superioctl maybe, and if we have lost context emit it before the batchbuffer.. There are probably various ways to do this, which is another argument for keeping super-ioctls device-specific. For i915-type hardware, Dave's approach above is probably the most attracting one. For Poulsbo, all state is always implicitly included, usually as a reference to a buffer object, so we don't really bother about contexts here. For hardware like the Unichrome, the state is stored in a limited set of registers. Here the drm can keep a copy of those registers for each context and do a smart update on a context switch. However, there are cases where it is very difficult to use cliprects from the drm, though I wouldn't say impossible. /Thomas Dave. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Dave Airlie wrote: I'm trying to figure out how context switches acutally work... the DRI lock is overloaded as context switcher, and there is code in the kernel to call out to a chipset specific context switch routine when the DRI lock is taken... but only ffb uses it... So I'm guessing the way context switches work today is that the DRI driver grabs the lock and after potentially updating the cliprects through X protocol, it emits all the state it depends on to the cards. Is the state emission done by just writing out a bunch of registers? Is this how the X server works too, except XAA/EXA acceleration doesn't depend on a lot of state and thus the DDX driver can emit everything for each operation? So yes userspaces notices context has changed and just re-emits everything into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at least there is an invarient state emission functions that notices what the context was and what the last server 3D users was (EXA or Xv texturing) and just dumps the state into the batchbuffer.. (or currently into the ring) How would this work if we didn't have a lock? You can't emit the state and then start rendering without a lock to keep the state in place... If the kernel doesn't restore any state, whats the point of the drm_context_t we pass to the kernel in drmLock? Should the kernel know how to restore state (this ties in to the email from jglisse on state tracking in drm and all the gallium jazz, I guess)? How do we identify state to the kernel, and how do we pass it in in the super-ioctl? Can we add a list of registers to be written and the values? I talked to Dave about it and we agreed that adding a drm_context_t to the super-ioctl would work, but now I'm thinking if the kernel doesn't track any state it wont really work. Maybe cross-client state sharing isn't useful for performance as Keith and Roland argues, but if the kernel doesn't restore state when it sees a super-ioctl coming from a different context, who does? My guess for one way is to store a buffer object with the current state emission in it, and submit it with the superioctl maybe, and if we have lost context emit it before the batchbuffer.. The way drivers actually work at the moment is to emit a full state as a preamble to each batchbuffer. Depending on the hardware, this can be pretty low overhead, and it seems that the trend in hardware is to make this operation cheaper and cheaper. This works fine without the lock. There is another complimentary trend to support one way or another multiple hardware contexts (obviously nvidia have done this for years), meaning that effectively the hardware (effectively) does the context switches. This is probably how most cards will end up working in the future, if not already. Neither of these need a lock for detecting context switches. Keith - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: DRI2 and lock-less operation
Keith Whitwell wrote: Dave Airlie wrote: I'm trying to figure out how context switches acutally work... the DRI lock is overloaded as context switcher, and there is code in the kernel to call out to a chipset specific context switch routine when the DRI lock is taken... but only ffb uses it... So I'm guessing the way context switches work today is that the DRI driver grabs the lock and after potentially updating the cliprects through X protocol, it emits all the state it depends on to the cards. Is the state emission done by just writing out a bunch of registers? Is this how the X server works too, except XAA/EXA acceleration doesn't depend on a lot of state and thus the DDX driver can emit everything for each operation? So yes userspaces notices context has changed and just re-emits everything into the batchbuffer it is going to send, for XAA/EXA stuff in Intel at least there is an invarient state emission functions that notices what the context was and what the last server 3D users was (EXA or Xv texturing) and just dumps the state into the batchbuffer.. (or currently into the ring) How would this work if we didn't have a lock? You can't emit the state and then start rendering without a lock to keep the state in place... If the kernel doesn't restore any state, whats the point of the drm_context_t we pass to the kernel in drmLock? Should the kernel know how to restore state (this ties in to the email from jglisse on state tracking in drm and all the gallium jazz, I guess)? How do we identify state to the kernel, and how do we pass it in in the super-ioctl? Can we add a list of registers to be written and the values? I talked to Dave about it and we agreed that adding a drm_context_t to the super-ioctl would work, but now I'm thinking if the kernel doesn't track any state it wont really work. Maybe cross-client state sharing isn't useful for performance as Keith and Roland argues, but if the kernel doesn't restore state when it sees a super-ioctl coming from a different context, who does? My guess for one way is to store a buffer object with the current state emission in it, and submit it with the superioctl maybe, and if we have lost context emit it before the batchbuffer.. The way drivers actually work at the moment is to emit a full state as a preamble to each batchbuffer. Depending on the hardware, this can be pretty low overhead, and it seems that the trend in hardware is to make this operation cheaper and cheaper. This works fine without the lock. There is another complimentary trend to support one way or another multiple hardware contexts (obviously nvidia have done this for years), meaning that effectively the hardware (effectively) does the context switches. This is probably how most cards will end up working in the future, if not already. Neither of these need a lock for detecting context switches. Keith I will go this way too for r300/r400/r500 there is not so much registers change with different contexts and registers which need special treatment will be handled by the drm (this boils down to where 3d get rendered and where is the zbuffer and pitch/tile informations on this 2 buffers; this informations will be embedded in drm_drawable as the cliprect if i am right :)). It will be up to client to reemit enough state for the card to be in good shape for its rendering and i don't think it's worthwhile to provide facilities to keep hw in a given state. So i don't need a lock and indeed my actual code doesn't use any except for ring buffer emission (only shared area among different client i can see in my case). Cheers, Jerome Glisse - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
DRI2 and lock-less operation
Hi all, I've been working on the DRI2 implementation recently and I'm starting to get a little confused, so I figured I'd throw a couple of questions out to the list. First of, I wrote up this summary shortly after XD http://wiki.x.org/wiki/DRI2 which upon re-reading is still pretty much up to date with what I'm trying to do. The buzzword summary from the page has * Lockless * Always private back buffers * No clip rects in DRI driver * No DDX driver part * Minimal X server part * Swap buffer and clip rects in kernel * No SAREA I've implemented the DRI2 xserver module (http://cgit.freedesktop.org/~krh/xserver/log/?h=dri2) and the new drm ioctls that is uses (http://cgit.freedesktop.org/~krh/drm/log/?h=dri2). I did the DDX part for the intel driver and DRI2 initialization consists of doing drmOpen (this is now up to the DDX driver), initialize the memory manager and use it to allocate stuff and then call DRI2ScreenInit(), passing in pScreen and the file descriptor. Basically, all of i830_dri.c isn't used in this mode. It's all delightfully simple, but I'm starting to reconsider whether the lockless bullet point is realistic. Note, the drawable lock is gone, we always render to private back buffers and do swap buffers in the kernel, so I'm only concerned with the DRI lock here. The idea is that since we have the memory manager and the super-ioctl and the X server now can push cliprects into the kernel in one atomic operation, we would be able to get rid of the DRI lock. My overall question, here is, is that feasible? I'm trying to figure out how context switches acutally work... the DRI lock is overloaded as context switcher, and there is code in the kernel to call out to a chipset specific context switch routine when the DRI lock is taken... but only ffb uses it... So I'm guessing the way context switches work today is that the DRI driver grabs the lock and after potentially updating the cliprects through X protocol, it emits all the state it depends on to the cards. Is the state emission done by just writing out a bunch of registers? Is this how the X server works too, except XAA/EXA acceleration doesn't depend on a lot of state and thus the DDX driver can emit everything for each operation? How would this work if we didn't have a lock? You can't emit the state and then start rendering without a lock to keep the state in place... If the kernel doesn't restore any state, whats the point of the drm_context_t we pass to the kernel in drmLock? Should the kernel know how to restore state (this ties in to the email from jglisse on state tracking in drm and all the gallium jazz, I guess)? How do we identify state to the kernel, and how do we pass it in in the super-ioctl? Can we add a list of registers to be written and the values? I talked to Dave about it and we agreed that adding a drm_context_t to the super-ioctl would work, but now I'm thinking if the kernel doesn't track any state it wont really work. Maybe cross-client state sharing isn't useful for performance as Keith and Roland argues, but if the kernel doesn't restore state when it sees a super-ioctl coming from a different context, who does? Sorry for the question-blitz, and I'm sure some of those sound a bit naiive as I'm not too familiar with the lower levels of many drivers. But if we're planning to move away from the DRI lock in the near future, we need to figure what to do here. Or maybe we're not getting rid of the DRI lock anytime soon, but that would be a shame given that we've got everything else lined up. cheers, Kristian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel