Re: [Bug #13819] system freeze when switching to console

Linus Torvalds Tue, 08 Sep 2009 11:09:04 -0700


On Tue, 8 Sep 2009, reinette chatre wrote:
> 
> As you can see from the kernel version it is not a build of a vanilla
> kernel. It only contains changes related to the wireless networking work
> I am doing.
> 
> Here is the output:


Thanks, this is great. It pinpoints the problem very effectively.

> [  352.803960] BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000084
> [  352.804006] IP: [<ffffffffa03ecaab>] i915_driver_irq_handler+0x26b/0xd20 
> [i915]

The code here is

          16:   48 8b 80 00 01 00 00    mov    0x100(%rax),%rax
          1d:   48 8b 50 08             mov    0x8(%rax),%rdx
          21:   48 85 d2                test   %rdx,%rdx
          24:   74 11                   je     0x37
          26:   49 8b 44 24 78          mov    0x78(%r12),%rax
          2b:*  8b 80 84 00 00 00       mov    0x84(%rax),%eax     <-- trapping 
instruction
          31:   89 82 08 08 00 00       mov    %eax,0x808(%rdx)
          37:   f6 45 a0 02             testb  $0x2,-0x60(%rbp)

and that "testb $0x2, -0x60(%rbp)" seems to be the

        if (iir & I915_USER_INTERRUPT) {

test if I'm reading things right. Although it could also be the

        if (eir & I915_ERROR_MEMORY_REFRESH) {

thing. The disassembly is totally impossible to read, because the stupid 
i915 driver is chock-full of crap like

        if (IS_G4X(dev)) {
                ..

which expands to insane amounts of code that check the PCI ID's one by 
one.

Intel guys: could you _please_ stop doing that. Create a capability mask 
in the device or something, so that you can test for "is this a G4x" with 
a single bit test, rather than have code like this:

        mov    0x31c(%rsi),%eax
        cmp    $0x2982,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2972,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2992,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x29a2,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2a02,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2a12,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2a42,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e02,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e12,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e22,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x2e32,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>
        cmp    $0x42,%eax
        je     0xffffffff8124b669 <i915_driver_irq_handler+177>

for that IS_G4X() thing (I'm not kidding - that's exactly a hundred bytes 
of code for that _stupid_ test, and it's inlined!)

Anyway, we're getting that DRM irq, and it has a normal IRQ stack trace:

> [  352.804006] Process Xorg (pid: 4424, threadinfo ffff8800b6b1a000, task 
> ffff880037373c00)
> [  352.804006] Call Trace:
> [  352.804006]  <IRQ> 
> [  352.804006]  [<ffffffff8106db7d>] ? mark_held_locks+0x6d/0x90
> [  352.804006]  [<ffffffff81098ee8>] handle_IRQ_event+0x68/0x170
> [  352.804006]  [<ffffffff8109ac01>] handle_edge_irq+0xc1/0x160
> [  352.804006]  [<ffffffff8100e76f>] handle_irq+0x1f/0x30
> [  352.804006]  [<ffffffff8100dc6a>] do_IRQ+0x6a/0xf0
> [  352.804006]  [<ffffffff8100c793>] ret_from_intr+0x0/0xf

.. but it happened just as we're tearing down the DRM irq handling:

> [  352.804006]  <EOI> 
> [  352.804006]  [<ffffffff81070b88>] ? lock_acquire+0xe8/0x100
> [  352.804006]  [<ffffffffa03c0b85>] ? drm_irq_uninstall+0x65/0x180 [drm]
> [  352.804006]  [<ffffffff8132d7b5>] ? mutex_lock_nested+0x45/0x320
> [  352.804006]  [<ffffffffa03c0b85>] ? drm_irq_uninstall+0x65/0x180 [drm]
> [  352.804006]  [<ffffffff8106de85>] ? trace_hardirqs_on_caller+0x145/0x190
> [  352.804006]  [<ffffffff8106dedd>] ? trace_hardirqs_on+0xd/0x10
> [  352.804006]  [<ffffffffa03c0b85>] ? drm_irq_uninstall+0x65/0x180 [drm]
> [  352.804006]  [<ffffffffa03f3335>] ? i915_gem_idle+0x225/0x330 [i915]
> [  352.804006]  [<ffffffffa03f34c7>] ? i915_gem_leavevt_ioctl+0x37/0x50 [i915]
> [  352.804006]  [<ffffffffa03bdafd>] ? drm_ioctl+0x17d/0x3c0 [drm]
> [  352.804006]  [<ffffffffa03f3490>] ? i915_gem_leavevt_ioctl+0x0/0x50 [i915]

so what is going on is that the i915 driver has obviously torn down some 
state before it uninstalls the irq, so the irq happens when the state has 
already been torn down, and the irq handler is not ready for that.

This patch *may* fix it - simply by getting rid of the irq early. However, 
I did not check whether maybe something in i915_gem_idle() actually needs 
the interrupt to be able to happen, so this is TOTALLY UNTESTED!

                Linus
---
 drivers/gpu/drm/i915/i915_gem.c |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 7edb5b9..80e5ba4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4232,15 +4232,11 @@ int
 i915_gem_leavevt_ioctl(struct drm_device *dev, void *data,
                       struct drm_file *file_priv)
 {
-       int ret;
-
        if (drm_core_check_feature(dev, DRIVER_MODESET))
                return 0;
 
-       ret = i915_gem_idle(dev);
        drm_irq_uninstall(dev);
-
-       return ret;
+       return i915_gem_idle(dev);
 }
 
 void
--
To unsubscribe from this list: send the line "unsubscribe kernel-testers" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug #13819] system freeze when switching to console

Reply via email to