Re: [PATCH] Add modesetting pageflip ioctl and corresponding drm event

Kristian Høgsberg Fri, 28 Aug 2009 13:06:21 -0700

This mail is getting out of control... too many sub-threads going
on...  maybe we shold break it out and talk about events vs kernel
scheduling and wait with the patch review until we've figured
something out.


2009/8/28 Thomas Hellström <tho...@shipmail.org>:
> Kristian Høgsberg skrev:
>>
>> 2009/8/27 Thomas Hellström <tho...@shipmail.org>:
>>
>>>
>>> Kristian Høgsberg skrev:
>>>
>>>>
>>>> ---
>>>>
>>>> Okay, here's the patch again with the locking fixed.  Everything else is
>>>> the same.  As for the potential problem with polling on the drm fd with
>>>> both dri1 and dri2 enabled, I suggest that we just disable the dri1
>>>> sigio handler in the xserver.  We'll do this in a commit that precedes
>>>> the
>>>> merge of the page flipping code, so that no server will be able to poll
>>>> on the drm fd in both dri1 and dri2 code.
>>>>
>>>> cheers,
>>>> Kristian
>>>>
>>>>
>>>
>>> <Grr>
>>>
>>> 1) The locking problems are still not fixed. See below. I just did a
>>> quick look, so the comment list is not complete.
>>>
>>
>> It's not as bad as you think.  There is one glitch where I changed
>> struct_mutex to mode_config.mutex and left dev->flip_list unprotected.
>>  The other issues you comment on should be fine, I've followed up in
>> detail below.
>>
>>
>
> I actually think it is. See below.
>
>>
>>
>>>
>>> </Grr>
>>>
>>> Even if that is fixed, I believe this patch adds a framework that
>>> doesn't belong in a modern day graphical subsystem for the following
>>> reasons, many of them already mentioned, but let me reiterate. Although
>>> the dri1 compatibility issue will be fixed with your suggestion.
>>>
>>> a) It puts requirements on user-space for correct functionality of DRM
>>> ioctls. I'm talking about the master needing to parse the event queue.
>>>
>>
>> Don't we already have a lot of requirements on userspace to use the
>> DRM?  It's not exactly trivial to prepare a execbuffer ioctl, or to
>> find a connector, crtc and mode and set that, for example.  Part of
>> this work is a new libdrm entry point to read an parse the events from
>> the fd.  See the dri2-swapbuffer branch in drm.git.
>>
>>
>
>
>>> b) It requires the master to act as a scheduler, and circumvents the DRM
>>> command submission mechanism through the delayed unpin callback. If this
>>> is to workaround any inability of GEM to serve a command submission
>>> while a previous command submission is blocked in the kernel, then IMHO
>>> that should be fixed and not worked around.
>>>
>>
>> It's not about workarounds.  Your suggestion *blocks the hw* while
>> waiting for vsync.
>
> No it doesn't. It blocks dri clients when they try to render to old fronts.
> Other dri clients would continue rendering. It provides a natural migration
> path to triple buffering where automagically nothing is blocked, and also to
> advanced software schedulers that can buffer command submissions instead of
> blocking.

You advocated blocking the command queue using a wait-for-vblank
command at some point.

> Now the concern about GEM was that if the kernel takes a global mutex
> _before_ blocking a client and doesn't release that mutex when the client is
> blocked, all rendering will naturally be blocked as a consequence.

That's not how the patch works.  We hold no locks while the client is blocked.

> What my suggestion *does* is to block the X server if AIGLX tries to render
> to the old front, and the command scheduler is simple. Now I'm not
> completely sure how serious that is, given that AIGLX will, as stated
> before, frequently block the X server anyway since it's not running in a
> separate thread.

Are you talking about an AIGLX client submitting an expensive shader
locking up the X server for a long time?  I'm not aware of any way
AIGLX can block the server for up to 20ms at a time right now, but
that is what it'll do if it blocks on vsync.

>>> c) The implementation will mostly be worked around with capable
>>> schedulers and hardware.
>>>
>>
>> This is not about the capability of the hw or the sw.  The design is
>> deliberate.
>>
>
> What I mean is that if I don't want the kernel code to do delayed unpins,
> because my cs already handles that, and I don't want the X server to block
> clients because my cs or hardware will handle that, I would do my very best
> to work around this code.

I don't see the difference between a delayed unpin and a fence.  We're
doing the same thing, why is it ok if we call it a fence?  You were
saying that we should make the vsync interrupt look like a sw command
queue and use that to fence the scan-out buffer right?  Is that really
better?  I feel a bit like we're getting dragged back into the GEM vs
TTM discussion here.  I have no stake in that battle, I'm just trying
to work with what's there. I don't think there's a fundamental problem
nor benefit with either, and what you can do with a TTM fence you can
do by waiting on the right GEM BO, as far as I understand.

>>> A couple of questions:
>>> Why are you guys so reluctant to use kernel scheduling for this instead
>>> of a mix of kernel / user-space scheduling?
>>>
>>
>> I'm not opposed to kernel scheduling as well, but userspace needs this
>> event.  I've made the case for AIGLX in the X server already, and
>> direct rendering clients (for example, compositors) that want to
>> handle input for as long as possible and avoid blocking has the same
>> needs.
>>
>
> I've still failed to understand this. Let's _assume_ for a while that the
> kernel handles scheduling perfectly in a non-blocking fashion, or let's
> assume we have triple-buffering, or let's assume we have a multi-FIFO card
> that can do pageflipping and vblank barriers on all FIFOs.
>
> Why then exactly are events needed? and why are we required to track the
> progress of the command fifo with events like jbarnes suggests, and finally
> why is this mechanism not needed in the non-pageflipping case? If you can
> give a typical use-case that would probably help a lot and avoid confusion.

I think we have different applications in mind.  My guess is that
you're thinking of a typical game workload that tries to render as
many frames per second as possible.  What I have in mind is an
application like a compositor or gui type application, where it
doesn't use the hw much and doesn't spend much time rendering but
needs to respond to input events and requests from clients.  What I'd
like to know is, how do you design the main loop of that application
so that it doesn't spend most of it's time blocked in some gl entry
point waiting for the swap to finish.  It's not an application that's
looking to queue up as many frames ahead as possible, that only
introduces lag between the input events and what's on the screen.

I'm not looking for "use threads" and I don't think there currently is
a way to do this using just the OpenGL/GLX APIs.  I agree that aside
from the case with AIGLX blocking the server, you're right, we don't
need the event.  But what I'd like to do is to add a GLX extension
that lets applications add a file descriptor to their main loop and
that way discover when the flip is done.  That lets them stay in phase
with the vsync, and provides a way to avoid spending most of their
time blocked in an ioctl.  And as I've said before, I'm not opposed to
doing the scheduling in the kernel, as long as we also get the event,
so applications have a chance of knowing when they might block.

>>> If the plan is to eliminate DRI2GetBuffers() once per frame, what will
>>> then be used to block clients rendering to the old back buffer?
>>
>> There'll be an event that's sent back after each DRI2SwapBuffer and
>> the clients will block on receiving that event.  We still need to send
>> a request to the xserver and receive confirmation that the xserver has
>> received it before we can render again.
>
> The above is to make sure the swap is scheduled before any continued
> rendering, right?

Yes.

>>  DRI2GetBuffers is a request
>> that expects a reply and will block the client on the xserver when we
>> call it.  DRI2SwapBuffers is an async request, ie there's no reply and
>> calling it wont block necessarily the client.  We still have to wait
>> for the new event before we can go on rendering, but doing it this way
>> makes the client and server less tightly coupled.  We may end up doing
>> the roundtrip between client and server at a point where the client
>> was going to block anyway (like disk i/o or something) saving a
>> context switch.
>>
>>
>
> Hmm. I don't understand fully. So up to now, my picture of how a frame was
> rendered looks like this.
>
> swapBuffers();
> if (check_for_needed_getbuffers())
>  getbuffers();
> render();
> swapBuffers();
>
> This is one X call per frame in the steady-state case. Now, where do you add
> the dri2 pageflip throttling if we don't need to call getbuffers()? Is it in
> check_for_needed_getbuffers()?

I described this in more detail and hopefully more coherently in my
email to Michel.  If that's still not clear, follow up there.

>>>>  drivers/gpu/drm/drm_crtc.c              |  171
>>>> ++++++++++++++++++++++++++++++-
>>>>  drivers/gpu/drm/drm_crtc_helper.c       |   10 ++
>>>>  drivers/gpu/drm/drm_drv.c               |    1 +
>>>>  drivers/gpu/drm/drm_fops.c              |   68 ++++++++++++-
>>>>  drivers/gpu/drm/drm_irq.c               |   42 ++++++++
>>>>  drivers/gpu/drm/i915/i915_drv.c         |    1 +
>>>>  drivers/gpu/drm/i915/intel_display.c    |   16 +++-
>>>>  drivers/gpu/drm/radeon/radeon_display.c |    3 +-
>>>>  include/drm/drm.h                       |   25 +++++
>>>>  include/drm/drmP.h                      |   32 ++++++
>>>>  include/drm/drm_crtc.h                  |   27 +++++
>>>>  include/drm/drm_crtc_helper.h           |    4 +
>>>>  include/drm/drm_mode.h                  |   16 +++
>>>>  13 files changed, 409 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c
>>>> index 8fab789..d877c21 100644
>>>> --- a/drivers/gpu/drm/drm_crtc.c
>>>> +++ b/drivers/gpu/drm/drm_crtc.c
>>>> @@ -34,6 +34,8 @@
>>>>  #include "drmP.h"
>>>>  #include "drm_crtc.h"
>>>>
>>>> +#undef set_base
>>>> +
>>>>  struct drm_prop_enum_list {
>>>>      int type;
>>>>      char *name;
>>>> @@ -342,6 +344,35 @@ void drm_framebuffer_cleanup(struct drm_framebuffer
>>>> *fb)
>>>>  EXPORT_SYMBOL(drm_framebuffer_cleanup);
>>>>
>>>>  /**
>>>> + * drm_crtc_async_flip - do a set_base call from a work queue
>>>> + * @work: work struct
>>>> + *
>>>> + * Called when a set_base call is queued by the page flip code.  This
>>>> + * allows the flip ioctl itself to return immediately and allow
>>>> userspace
>>>> + * to continue working.
>>>> + */
>>>> +static void drm_crtc_async_flip(struct work_struct *work)
>>>> +{
>>>> +     struct drm_crtc *crtc = container_of(work, struct drm_crtc,
>>>> async_flip);
>>>> +     struct drm_device *dev = crtc->dev;
>>>> +     struct drm_pending_flip *pending;
>>>> +
>>>> +     mutex_lock(&dev->mode_config.mutex);
>>>> +
>>>> +     BUG_ON(crtc->pending_flip == NULL);
>>>> +
>>>> +     crtc->funcs->set_base(crtc, crtc->x, crtc->y, NULL);
>>>> +
>>>> +     pending = crtc->pending_flip;
>>>> +     crtc->pending_flip = NULL;
>>>> +
>>>> +     pending->frame = drm_vblank_count(dev, crtc->pipe);
>>>>
>>>>
>>>
>>> What's protecting dev->flip_list here?
>>>
>>
>> That should be struct_mutex.  As mentioned above, I missed it when I
>> changed the mutex to mode_config.mutex.
>>
>
> Hence the locking order question. If there is a need to keep the mode_config
> mutex held _as well_ across the list insertion, since
> crtc->pending_flip.link is modified at list insertion. That really depends
> on what's protecting that member.

I missed a mutex around the dev->flip_list manipulation, I didn't
confuse the locking order.
...

>>>> -/** No-op. */
>>>> +ssize_t drm_read(struct file *filp, char __user *buffer,
>>>> +              size_t count, loff_t *offset)
>>>> +{
>>>> +     struct drm_file *file_priv = filp->private_data;
>>>> +     struct drm_device *dev = file_priv->minor->dev;
>>>> +     struct drm_pending_event *event;
>>>> +     ssize_t total, ret;
>>>> +
>>>> +     ret = wait_event_interruptible(file_priv->event_wait,
>>>> +
>>>>  !list_empty(&file_priv->event_list));
>>>> +     if (ret < 0)
>>>> +             return ret;
>>>> +
>>>> +     total = 0;
>>>>
>>>>
>>>
>>> What's protecting file_priv->event_list here?
>>>
>>
>> You can test for list emptiness without taking the lock.
>>
>
> Are you suggesting accessing a member of a mutex protected struct without
> taking the mutex?
> Do you have a pointer to where what you say above is documented?
>
> That would be highly architecture dependent and in this particular case
> require the processor being capable of atomic pointer reads and writes, no
> processor r/w reordering and the compiler assuming the list head being
> declared volatile.

My drm_read() code below is broken in that it needs to recheck for
list_empty again after taking the lock.  Other than that, testing for
list empty without taking the lock that normally protects the list is
a common idiom in the kernel.  It doesn't require write barriers or
atomic ops, and of course, since it's done without the lock somebody
could be changing the list while you're trying to establish whether
it's empty.  But the end result is no different from grabbing the
lock, testing for empty, and then dropping the lock.  Once you drop
the lock, the result is potentially invalid anyway.  So for something
like poll implementations, where you just need to return whether
there's data to read or not, it works just fine.  If several
processes/threads share the same fd, then the poll syscall is
inherently racy, since between poll and read, some other process may
have read the data.

It's also used like in drm_read below, for testing for an empty list
before taking the lock, though that's more of a micro-optimization and
requires that the list is tested for non-empty again once under the
lock before you operate on the list.

See for example drivers/acpi/event.c, infiniband/core/ucm.c,
drivers/usb/core/devio.c, sound/core/control.c for cases where a list
is altered under a lock, but tested for emptiness without the lock in
the poll implementation.  There are even more cases with handrolled
linked lists or other data structures used in the same way - no lock
necessary when testing for empty.

> Even if that's the case let's assume list_empty() returns true, and then
> another thread sneaks in and empties the queue before you take the mutex.
> The next thing you do is to access and try to remove "event", which doesn't
> exist because the list is empty. If you're lucky you'll see an oops. If not,
> you end up with strange memory corruption.



>>
>>>>
>>>> +     while (!list_empty(&file_priv->event_list)) {
>>>> +             mutex_lock(&dev->struct_mutex);
>>>> +             event = list_first_entry(&file_priv->event_list,
>>>> +                                      struct drm_pending_event, link);
>>>> +             if (total + event->event->length > count) {
>>>> +                     mutex_unlock(&dev->struct_mutex);
>>>> +                     break;
>>>> +             }
>>>> +             list_del(&event->link);
>>>> +             mutex_unlock(&dev->struct_mutex);
>>>> +
>>>> +             if (copy_to_user(buffer + total,
>>>> +                              event->event, event->event->length)) {
>>>> +                     total = -EFAULT;
>>>> +                     break;
>>>> +             }
>>>> +
>>>> +             total += event->event->length;
>>>> +             event->destroy(event);
>>>> +     }
>>>> +
>>>> +     return total;
>>>> +}
>>>> +EXPORT_SYMBOL(drm_read);
>>>> +
>>>>  unsigned int drm_poll(struct file *filp, struct poll_table_struct
>>>> *wait)
>>>>  {
>>>> -     return 0;
>>>> +     struct drm_file *file_priv = filp->private_data;
>>>> +     unsigned int mask = 0;
>>>> +
>>>> +     poll_wait(filp, &file_priv->event_wait, wait);
>>>> +
>>>
>>> And here?
>>
>> Again, testing list emptiness.
>
> See above. Although here, the consequenses probably aren't that fatal.
>
>>>> +     if (!list_empty(&file_priv->event_list))
>>>> +             mask |= POLLIN | POLLRDNORM;
>>>> +
>>>> +     return mask;
>>>>  }

...

>>>> +void drm_finish_pending_flip(struct drm_device *dev,
>>>> +                          struct drm_pending_flip *f, u32 frame)
>>>> +{
>>>> +     struct timeval now;
>>>> +
>>>> +     f->event.frame = frame;
>>>> +     do_gettimeofday(&now);
>>>> +     f->event.tv_sec = now.tv_sec;
>>>> +     f->event.tv_usec = now.tv_usec;
>>>> +     drm_vblank_put(dev, f->pipe);
>>>> +     list_del_init(&f->link);
>>>
>>> What's protecting the file_priv->event_list here?
>>
>> This function is always called with the struct_mutex held.
>
> No. It's not. It's called with the mode_config mutex held.

No it's protected by struct_mutex, but there's one call where I missed
struct_mutex: the path in the page_flip ioctl where we're replacing a
pending flip.  I've added struct_mutex locking around the call.

cheers,
Kristian

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: [PATCH] Add modesetting pageflip ioctl and corresponding drm event

Reply via email to