This mail is getting out of control... too many sub-threads going on... maybe we shold break it out and talk about events vs kernel scheduling and wait with the patch review until we've figured something out.
2009/8/28 Thomas Hellström <tho...@shipmail.org>: > Kristian Høgsberg skrev: >> >> 2009/8/27 Thomas Hellström <tho...@shipmail.org>: >> >>> >>> Kristian Høgsberg skrev: >>> >>>> >>>> --- >>>> >>>> Okay, here's the patch again with the locking fixed. Everything else is >>>> the same. As for the potential problem with polling on the drm fd with >>>> both dri1 and dri2 enabled, I suggest that we just disable the dri1 >>>> sigio handler in the xserver. We'll do this in a commit that precedes >>>> the >>>> merge of the page flipping code, so that no server will be able to poll >>>> on the drm fd in both dri1 and dri2 code. >>>> >>>> cheers, >>>> Kristian >>>> >>>> >>> >>> <Grr> >>> >>> 1) The locking problems are still not fixed. See below. I just did a >>> quick look, so the comment list is not complete. >>> >> >> It's not as bad as you think. There is one glitch where I changed >> struct_mutex to mode_config.mutex and left dev->flip_list unprotected. >> The other issues you comment on should be fine, I've followed up in >> detail below. >> >> > > I actually think it is. See below. > >> >> >>> >>> </Grr> >>> >>> Even if that is fixed, I believe this patch adds a framework that >>> doesn't belong in a modern day graphical subsystem for the following >>> reasons, many of them already mentioned, but let me reiterate. Although >>> the dri1 compatibility issue will be fixed with your suggestion. >>> >>> a) It puts requirements on user-space for correct functionality of DRM >>> ioctls. I'm talking about the master needing to parse the event queue. >>> >> >> Don't we already have a lot of requirements on userspace to use the >> DRM? It's not exactly trivial to prepare a execbuffer ioctl, or to >> find a connector, crtc and mode and set that, for example. Part of >> this work is a new libdrm entry point to read an parse the events from >> the fd. See the dri2-swapbuffer branch in drm.git. >> >> > > >>> b) It requires the master to act as a scheduler, and circumvents the DRM >>> command submission mechanism through the delayed unpin callback. If this >>> is to workaround any inability of GEM to serve a command submission >>> while a previous command submission is blocked in the kernel, then IMHO >>> that should be fixed and not worked around. >>> >> >> It's not about workarounds. Your suggestion *blocks the hw* while >> waiting for vsync. > > No it doesn't. It blocks dri clients when they try to render to old fronts. > Other dri clients would continue rendering. It provides a natural migration > path to triple buffering where automagically nothing is blocked, and also to > advanced software schedulers that can buffer command submissions instead of > blocking. You advocated blocking the command queue using a wait-for-vblank command at some point. > Now the concern about GEM was that if the kernel takes a global mutex > _before_ blocking a client and doesn't release that mutex when the client is > blocked, all rendering will naturally be blocked as a consequence. That's not how the patch works. We hold no locks while the client is blocked. > What my suggestion *does* is to block the X server if AIGLX tries to render > to the old front, and the command scheduler is simple. Now I'm not > completely sure how serious that is, given that AIGLX will, as stated > before, frequently block the X server anyway since it's not running in a > separate thread. Are you talking about an AIGLX client submitting an expensive shader locking up the X server for a long time? I'm not aware of any way AIGLX can block the server for up to 20ms at a time right now, but that is what it'll do if it blocks on vsync. >>> c) The implementation will mostly be worked around with capable >>> schedulers and hardware. >>> >> >> This is not about the capability of the hw or the sw. The design is >> deliberate. >> > > What I mean is that if I don't want the kernel code to do delayed unpins, > because my cs already handles that, and I don't want the X server to block > clients because my cs or hardware will handle that, I would do my very best > to work around this code. I don't see the difference between a delayed unpin and a fence. We're doing the same thing, why is it ok if we call it a fence? You were saying that we should make the vsync interrupt look like a sw command queue and use that to fence the scan-out buffer right? Is that really better? I feel a bit like we're getting dragged back into the GEM vs TTM discussion here. I have no stake in that battle, I'm just trying to work with what's there. I don't think there's a fundamental problem nor benefit with either, and what you can do with a TTM fence you can do by waiting on the right GEM BO, as far as I understand. >>> A couple of questions: >>> Why are you guys so reluctant to use kernel scheduling for this instead >>> of a mix of kernel / user-space scheduling? >>> >> >> I'm not opposed to kernel scheduling as well, but userspace needs this >> event. I've made the case for AIGLX in the X server already, and >> direct rendering clients (for example, compositors) that want to >> handle input for as long as possible and avoid blocking has the same >> needs. >> > > I've still failed to understand this. Let's _assume_ for a while that the > kernel handles scheduling perfectly in a non-blocking fashion, or let's > assume we have triple-buffering, or let's assume we have a multi-FIFO card > that can do pageflipping and vblank barriers on all FIFOs. > > Why then exactly are events needed? and why are we required to track the > progress of the command fifo with events like jbarnes suggests, and finally > why is this mechanism not needed in the non-pageflipping case? If you can > give a typical use-case that would probably help a lot and avoid confusion. I think we have different applications in mind. My guess is that you're thinking of a typical game workload that tries to render as many frames per second as possible. What I have in mind is an application like a compositor or gui type application, where it doesn't use the hw much and doesn't spend much time rendering but needs to respond to input events and requests from clients. What I'd like to know is, how do you design the main loop of that application so that it doesn't spend most of it's time blocked in some gl entry point waiting for the swap to finish. It's not an application that's looking to queue up as many frames ahead as possible, that only introduces lag between the input events and what's on the screen. I'm not looking for "use threads" and I don't think there currently is a way to do this using just the OpenGL/GLX APIs. I agree that aside from the case with AIGLX blocking the server, you're right, we don't need the event. But what I'd like to do is to add a GLX extension that lets applications add a file descriptor to their main loop and that way discover when the flip is done. That lets them stay in phase with the vsync, and provides a way to avoid spending most of their time blocked in an ioctl. And as I've said before, I'm not opposed to doing the scheduling in the kernel, as long as we also get the event, so applications have a chance of knowing when they might block. >>> If the plan is to eliminate DRI2GetBuffers() once per frame, what will >>> then be used to block clients rendering to the old back buffer? >> >> There'll be an event that's sent back after each DRI2SwapBuffer and >> the clients will block on receiving that event. We still need to send >> a request to the xserver and receive confirmation that the xserver has >> received it before we can render again. > > The above is to make sure the swap is scheduled before any continued > rendering, right? Yes. >> DRI2GetBuffers is a request >> that expects a reply and will block the client on the xserver when we >> call it. DRI2SwapBuffers is an async request, ie there's no reply and >> calling it wont block necessarily the client. We still have to wait >> for the new event before we can go on rendering, but doing it this way >> makes the client and server less tightly coupled. We may end up doing >> the roundtrip between client and server at a point where the client >> was going to block anyway (like disk i/o or something) saving a >> context switch. >> >> > > Hmm. I don't understand fully. So up to now, my picture of how a frame was > rendered looks like this. > > swapBuffers(); > if (check_for_needed_getbuffers()) > getbuffers(); > render(); > swapBuffers(); > > This is one X call per frame in the steady-state case. Now, where do you add > the dri2 pageflip throttling if we don't need to call getbuffers()? Is it in > check_for_needed_getbuffers()? I described this in more detail and hopefully more coherently in my email to Michel. If that's still not clear, follow up there. >>>> drivers/gpu/drm/drm_crtc.c | 171 >>>> ++++++++++++++++++++++++++++++- >>>> drivers/gpu/drm/drm_crtc_helper.c | 10 ++ >>>> drivers/gpu/drm/drm_drv.c | 1 + >>>> drivers/gpu/drm/drm_fops.c | 68 ++++++++++++- >>>> drivers/gpu/drm/drm_irq.c | 42 ++++++++ >>>> drivers/gpu/drm/i915/i915_drv.c | 1 + >>>> drivers/gpu/drm/i915/intel_display.c | 16 +++- >>>> drivers/gpu/drm/radeon/radeon_display.c | 3 +- >>>> include/drm/drm.h | 25 +++++ >>>> include/drm/drmP.h | 32 ++++++ >>>> include/drm/drm_crtc.h | 27 +++++ >>>> include/drm/drm_crtc_helper.h | 4 + >>>> include/drm/drm_mode.h | 16 +++ >>>> 13 files changed, 409 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c >>>> index 8fab789..d877c21 100644 >>>> --- a/drivers/gpu/drm/drm_crtc.c >>>> +++ b/drivers/gpu/drm/drm_crtc.c >>>> @@ -34,6 +34,8 @@ >>>> #include "drmP.h" >>>> #include "drm_crtc.h" >>>> >>>> +#undef set_base >>>> + >>>> struct drm_prop_enum_list { >>>> int type; >>>> char *name; >>>> @@ -342,6 +344,35 @@ void drm_framebuffer_cleanup(struct drm_framebuffer >>>> *fb) >>>> EXPORT_SYMBOL(drm_framebuffer_cleanup); >>>> >>>> /** >>>> + * drm_crtc_async_flip - do a set_base call from a work queue >>>> + * @work: work struct >>>> + * >>>> + * Called when a set_base call is queued by the page flip code. This >>>> + * allows the flip ioctl itself to return immediately and allow >>>> userspace >>>> + * to continue working. >>>> + */ >>>> +static void drm_crtc_async_flip(struct work_struct *work) >>>> +{ >>>> + struct drm_crtc *crtc = container_of(work, struct drm_crtc, >>>> async_flip); >>>> + struct drm_device *dev = crtc->dev; >>>> + struct drm_pending_flip *pending; >>>> + >>>> + mutex_lock(&dev->mode_config.mutex); >>>> + >>>> + BUG_ON(crtc->pending_flip == NULL); >>>> + >>>> + crtc->funcs->set_base(crtc, crtc->x, crtc->y, NULL); >>>> + >>>> + pending = crtc->pending_flip; >>>> + crtc->pending_flip = NULL; >>>> + >>>> + pending->frame = drm_vblank_count(dev, crtc->pipe); >>>> >>>> >>> >>> What's protecting dev->flip_list here? >>> >> >> That should be struct_mutex. As mentioned above, I missed it when I >> changed the mutex to mode_config.mutex. >> > > Hence the locking order question. If there is a need to keep the mode_config > mutex held _as well_ across the list insertion, since > crtc->pending_flip.link is modified at list insertion. That really depends > on what's protecting that member. I missed a mutex around the dev->flip_list manipulation, I didn't confuse the locking order. ... >>>> -/** No-op. */ >>>> +ssize_t drm_read(struct file *filp, char __user *buffer, >>>> + size_t count, loff_t *offset) >>>> +{ >>>> + struct drm_file *file_priv = filp->private_data; >>>> + struct drm_device *dev = file_priv->minor->dev; >>>> + struct drm_pending_event *event; >>>> + ssize_t total, ret; >>>> + >>>> + ret = wait_event_interruptible(file_priv->event_wait, >>>> + >>>> !list_empty(&file_priv->event_list)); >>>> + if (ret < 0) >>>> + return ret; >>>> + >>>> + total = 0; >>>> >>>> >>> >>> What's protecting file_priv->event_list here? >>> >> >> You can test for list emptiness without taking the lock. >> > > Are you suggesting accessing a member of a mutex protected struct without > taking the mutex? > Do you have a pointer to where what you say above is documented? > > That would be highly architecture dependent and in this particular case > require the processor being capable of atomic pointer reads and writes, no > processor r/w reordering and the compiler assuming the list head being > declared volatile. My drm_read() code below is broken in that it needs to recheck for list_empty again after taking the lock. Other than that, testing for list empty without taking the lock that normally protects the list is a common idiom in the kernel. It doesn't require write barriers or atomic ops, and of course, since it's done without the lock somebody could be changing the list while you're trying to establish whether it's empty. But the end result is no different from grabbing the lock, testing for empty, and then dropping the lock. Once you drop the lock, the result is potentially invalid anyway. So for something like poll implementations, where you just need to return whether there's data to read or not, it works just fine. If several processes/threads share the same fd, then the poll syscall is inherently racy, since between poll and read, some other process may have read the data. It's also used like in drm_read below, for testing for an empty list before taking the lock, though that's more of a micro-optimization and requires that the list is tested for non-empty again once under the lock before you operate on the list. See for example drivers/acpi/event.c, infiniband/core/ucm.c, drivers/usb/core/devio.c, sound/core/control.c for cases where a list is altered under a lock, but tested for emptiness without the lock in the poll implementation. There are even more cases with handrolled linked lists or other data structures used in the same way - no lock necessary when testing for empty. > Even if that's the case let's assume list_empty() returns true, and then > another thread sneaks in and empties the queue before you take the mutex. > The next thing you do is to access and try to remove "event", which doesn't > exist because the list is empty. If you're lucky you'll see an oops. If not, > you end up with strange memory corruption. >> >>>> >>>> + while (!list_empty(&file_priv->event_list)) { >>>> + mutex_lock(&dev->struct_mutex); >>>> + event = list_first_entry(&file_priv->event_list, >>>> + struct drm_pending_event, link); >>>> + if (total + event->event->length > count) { >>>> + mutex_unlock(&dev->struct_mutex); >>>> + break; >>>> + } >>>> + list_del(&event->link); >>>> + mutex_unlock(&dev->struct_mutex); >>>> + >>>> + if (copy_to_user(buffer + total, >>>> + event->event, event->event->length)) { >>>> + total = -EFAULT; >>>> + break; >>>> + } >>>> + >>>> + total += event->event->length; >>>> + event->destroy(event); >>>> + } >>>> + >>>> + return total; >>>> +} >>>> +EXPORT_SYMBOL(drm_read); >>>> + >>>> unsigned int drm_poll(struct file *filp, struct poll_table_struct >>>> *wait) >>>> { >>>> - return 0; >>>> + struct drm_file *file_priv = filp->private_data; >>>> + unsigned int mask = 0; >>>> + >>>> + poll_wait(filp, &file_priv->event_wait, wait); >>>> + >>> >>> And here? >> >> Again, testing list emptiness. > > See above. Although here, the consequenses probably aren't that fatal. > >>>> + if (!list_empty(&file_priv->event_list)) >>>> + mask |= POLLIN | POLLRDNORM; >>>> + >>>> + return mask; >>>> } ... >>>> +void drm_finish_pending_flip(struct drm_device *dev, >>>> + struct drm_pending_flip *f, u32 frame) >>>> +{ >>>> + struct timeval now; >>>> + >>>> + f->event.frame = frame; >>>> + do_gettimeofday(&now); >>>> + f->event.tv_sec = now.tv_sec; >>>> + f->event.tv_usec = now.tv_usec; >>>> + drm_vblank_put(dev, f->pipe); >>>> + list_del_init(&f->link); >>> >>> What's protecting the file_priv->event_list here? >> >> This function is always called with the struct_mutex held. > > No. It's not. It's called with the mode_config mutex held. No it's protected by struct_mutex, but there's one call where I missed struct_mutex: the path in the page_flip ioctl where we're replacing a pending flip. I've added struct_mutex locking around the call. cheers, Kristian ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july -- _______________________________________________ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel