i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-10 Thread Daniel Vetter
On Fri, Apr 6, 2012 at 23:31, Jiri Slaby  wrote:
>> That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close
>> a race where the pipestat triggered an interrupt after we processed the
>> secondary registers and before reseting the primary.
>>
>> But the basic premise that we should only enter the interrupt handler
>> with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).
>
> Ok, this behavior is definitely new. I get several "nobody cared" about
> this interrupt a week. This never used to happen. And something weird
> emerges in /proc/interrupts when this happens:
> ?42: ? ?1003292 ? ?1212890 ? PCI-MSI-edge ? ? ??s::00:02.0
> instead of
> ?42: ? ?1006715 ? ?1218472 ? PCI-MSI-edge ? ? ?i915 at pci::00:02.0

This looks ugly. Can you try to reproduce on 3.4-rc2? That should
contain everything that -next currently contains drm/i915-wise. If it
still happens there, please bisect it.

Also please check whether any of the subordinate interrupt regs
(pipestat) is stuck and might cause these interrupts as Jesse
suggested.

Thanks, Daniel
-- 
Daniel Vetter
daniel.vetter at ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch


i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-10 Thread Jiri Slaby
On 04/07/2012 12:40 AM, Thomas Gleixner wrote:
> On Fri, 6 Apr 2012, Jiri Slaby wrote:
>> It very looks like the generic IRQ handling code is broken. Like it
>> frees/corrupts irq_desc and ...
> 
> OMG, your problem analyzing skills are amazing.

Hehe, no I did *no* analysis. I stand here as a bug reporter.

> What the heck makes you assume that the irq core code is broken?  Core
> code, which works on a gazillion of machines and different device
> drivers and does not corrupt anything except that i915 thingy?

Note that this is a -next regression. And i915 graphics used. This
definitely doesn't run on a gazillion of machines.

> If you're still convinced that the irq core is messing with your
> device string,

Nope, thanks for the input.

-- 
js
suse labs




Re: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-10 Thread Jiri Slaby
On 04/07/2012 12:40 AM, Thomas Gleixner wrote:
 On Fri, 6 Apr 2012, Jiri Slaby wrote:
 It very looks like the generic IRQ handling code is broken. Like it
 frees/corrupts irq_desc and ...
 
 OMG, your problem analyzing skills are amazing.

Hehe, no I did *no* analysis. I stand here as a bug reporter.

 What the heck makes you assume that the irq core code is broken?  Core
 code, which works on a gazillion of machines and different device
 drivers and does not corrupt anything except that i915 thingy?

Note that this is a -next regression. And i915 graphics used. This
definitely doesn't run on a gazillion of machines.

 If you're still convinced that the irq core is messing with your
 device string,

Nope, thanks for the input.

-- 
js
suse labs


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-10 Thread Daniel Vetter
On Fri, Apr 6, 2012 at 23:31, Jiri Slaby jsl...@suse.cz wrote:
 That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close
 a race where the pipestat triggered an interrupt after we processed the
 secondary registers and before reseting the primary.

 But the basic premise that we should only enter the interrupt handler
 with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

 Ok, this behavior is definitely new. I get several nobody cared about
 this interrupt a week. This never used to happen. And something weird
 emerges in /proc/interrupts when this happens:
  42:    1003292    1212890   PCI-MSI-edge      �s::00:02.0
 instead of
  42:    1006715    1218472   PCI-MSI-edge      i915@pci::00:02.0

This looks ugly. Can you try to reproduce on 3.4-rc2? That should
contain everything that -next currently contains drm/i915-wise. If it
still happens there, please bisect it.

Also please check whether any of the subordinate interrupt regs
(pipestat) is stuck and might cause these interrupts as Jesse
suggested.

Thanks, Daniel
-- 
Daniel Vetter
daniel.vet...@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-09 Thread Dave Airlie
>> You know what? suspend calls free_irq() via i915_drm_freeze() ->
>> drm_irq_uninstall() and the resume code calls request_irq() again.
>> free_irq() removes the action and request_irq installs it fresh.
>
> Yeah this is a known issue with the DRM code, I thought Dave had a
> fix queued a long time ago though... ?Dave?

/me doesn't remember seeing one but maybe this one?

http://lists.freedesktop.org/archives/dri-devel/2011-August/013407.html

probably fell down a hole.

Dave.


i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-09 Thread Jesse Barnes
On Sat, 7 Apr 2012 00:40:28 +0200 (CEST)
Thomas Gleixner  wrote:
> You know what? suspend calls free_irq() via i915_drm_freeze() ->
> drm_irq_uninstall() and the resume code calls request_irq() again.
> free_irq() removes the action and request_irq installs it fresh.

Yeah this is a known issue with the DRM code, I thought Dave had a
fix queued a long time ago though...  Dave?

-- 
Jesse Barnes, Intel Open Source Technology Center
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: 



Re: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-09 Thread Jesse Barnes
On Sat, 7 Apr 2012 00:40:28 +0200 (CEST)
Thomas Gleixner t...@linutronix.de wrote:
 You know what? suspend calls free_irq() via i915_drm_freeze() -
 drm_irq_uninstall() and the resume code calls request_irq() again.
 free_irq() removes the action and request_irq installs it fresh.

Yeah this is a known issue with the DRM code, I thought Dave had a
fix queued a long time ago though...  Dave?

-- 
Jesse Barnes, Intel Open Source Technology Center


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-09 Thread Dave Airlie
 You know what? suspend calls free_irq() via i915_drm_freeze() -
 drm_irq_uninstall() and the resume code calls request_irq() again.
 free_irq() removes the action and request_irq installs it fresh.

 Yeah this is a known issue with the DRM code, I thought Dave had a
 fix queued a long time ago though...  Dave?

/me doesn't remember seeing one but maybe this one?

http://lists.freedesktop.org/archives/dri-devel/2011-August/013407.html

probably fell down a hole.

Dave.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-07 Thread Thomas Gleixner
On Fri, 6 Apr 2012, Jiri Slaby wrote:

> On 03/30/2012 02:24 PM, Chris Wilson wrote:
> > On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby  wrote:
> >> On 03/30/2012 12:45 PM, Chris Wilson wrote:
> >>> On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby  wrote:
>  I don't know what to dump more, because iir is obviously zero too. What
>  other sources of interrupts are on the (G33) chip?
> >>>
> >>> IIR is the master interrupt, with chained secondary interrupt statuses.
> >>> If IIR is 0, the interrupt wasn't raised by the GPU.
> >>
> >> This does not make sense, the handler does something different. Even if
> >> IIR is 0, it still takes a look at pipe stats.
> > 
> > That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close
> > a race where the pipestat triggered an interrupt after we processed the
> > secondary registers and before reseting the primary.
> > 
> > But the basic premise that we should only enter the interrupt handler
> > with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).
> 
> Ok, this behavior is definitely new. I get several "nobody cared" about
> this interrupt a week. This never used to happen. And something weird
> emerges in /proc/interrupts when this happens:
>  42:10032921212890   PCI-MSI-edge  ???s::00:02.0
> instead of
>  42:10067151218472   PCI-MSI-edge  i915 at pci::00:02.0
> 
> It very looks like the generic IRQ handling code is broken. Like it
> frees/corrupts irq_desc and ...

OMG, your problem analyzing skills are amazing.

If irq_desc would have been freed, then it wouldn't print the numbers
and the irq type. And irq_desc is not corrupted either, otherwise the
whole thing would explode in your face.

The printout of the name is done via action->name. The irq action
merily holds a pointer to the device name string, which is handed over
with request_irq. So you are saying that the core code corrupts the
memory which was handed in via a pointer by the driver?

So now that's really an amazing core feature:

It corrupts the memory with weird characters and still maintains the
PCI bus number correct. So it not only corrupts memory it also moves
the PCI part of the string a few characters to the end.

If the pointer in the irq action would have been corrupted, then you
would see a few weird characters and then the full string, not a
random thing which is half correct and shifted by a few bytes.

The pointer which is handed in is dev->devname, which gets allocated
and filled in drm_pci_set_busid().

> ... then as well calls random handlers.

Which random handlers would be called? The core code only calls
handlers which are associated to an particular interrupt. And only
when that particular interrupt is raised and not because the CPU pulls
interrupt events out of thin air.

And it calls the stupid i915 handler and not something else, otherwise
you would not observe the IIR=0 printk or whatever you put there for
debugging.

> Suspend/resume cycle helps in this case and "i915 at pci::00:02.0" is
> back in /proc/interrupts as can be seen above.

That's proving what? That the irq core code magically restores the
correct string, right? And probably it stops calling random handlers
as well. Brilliant deduction.

You know what? suspend calls free_irq() via i915_drm_freeze() ->
drm_irq_uninstall() and the resume code calls request_irq() again.
free_irq() removes the action and request_irq installs it fresh.

So now the interesting part is that free_irq() checks the dev_id
cookie for a match, which is also stored in the irq action. So we are
dealing with a magic corrupt only action->name and action->handler
problem. Pretty realistic.

What the heck makes you assume that the irq core code is broken?  Core
code, which works on a gazillion of machines and different device
drivers and does not corrupt anything except that i915 thingy?

Come on, you need to provide better evidence than weird ass guessing.

If you're still convinced that the irq core is messing with your
device string, then simply hand in a NULL pointer when requesting the
interrupt. That will make the core code explode nicely when it tries
to modify that memory.

Thanks,

tglx


i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-07 Thread Jiri Slaby
On 03/30/2012 02:24 PM, Chris Wilson wrote:
> On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby  wrote:
>> On 03/30/2012 12:45 PM, Chris Wilson wrote:
>>> On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby  wrote:
 I don't know what to dump more, because iir is obviously zero too. What
 other sources of interrupts are on the (G33) chip?
>>>
>>> IIR is the master interrupt, with chained secondary interrupt statuses.
>>> If IIR is 0, the interrupt wasn't raised by the GPU.
>>
>> This does not make sense, the handler does something different. Even if
>> IIR is 0, it still takes a look at pipe stats.
> 
> That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close
> a race where the pipestat triggered an interrupt after we processed the
> secondary registers and before reseting the primary.
> 
> But the basic premise that we should only enter the interrupt handler
> with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

Ok, this behavior is definitely new. I get several "nobody cared" about
this interrupt a week. This never used to happen. And something weird
emerges in /proc/interrupts when this happens:
 42:10032921212890   PCI-MSI-edge  ?s::00:02.0
instead of
 42:10067151218472   PCI-MSI-edge  i915 at pci::00:02.0

It very looks like the generic IRQ handling code is broken. Like it
frees/corrupts irq_desc and then as well calls random handlers.

Suspend/resume cycle helps in this case and "i915 at pci::00:02.0" is
back in /proc/interrupts as can be seen above.

Running 3.3.0-next-20120326_64+ now.

thanks,
-- 
js
suse labs



Re: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-07 Thread Thomas Gleixner
On Fri, 6 Apr 2012, Jiri Slaby wrote:

 On 03/30/2012 02:24 PM, Chris Wilson wrote:
  On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby jsl...@suse.cz wrote:
  On 03/30/2012 12:45 PM, Chris Wilson wrote:
  On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jsl...@suse.cz wrote:
  I don't know what to dump more, because iir is obviously zero too. What
  other sources of interrupts are on the (G33) chip?
 
  IIR is the master interrupt, with chained secondary interrupt statuses.
  If IIR is 0, the interrupt wasn't raised by the GPU.
 
  This does not make sense, the handler does something different. Even if
  IIR is 0, it still takes a look at pipe stats.
  
  That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close
  a race where the pipestat triggered an interrupt after we processed the
  secondary registers and before reseting the primary.
  
  But the basic premise that we should only enter the interrupt handler
  with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).
 
 Ok, this behavior is definitely new. I get several nobody cared about
 this interrupt a week. This never used to happen. And something weird
 emerges in /proc/interrupts when this happens:
  42:10032921212890   PCI-MSI-edge  ???s::00:02.0
 instead of
  42:10067151218472   PCI-MSI-edge  i915@pci::00:02.0
 
 It very looks like the generic IRQ handling code is broken. Like it
 frees/corrupts irq_desc and ...

OMG, your problem analyzing skills are amazing.

If irq_desc would have been freed, then it wouldn't print the numbers
and the irq type. And irq_desc is not corrupted either, otherwise the
whole thing would explode in your face.

The printout of the name is done via action-name. The irq action
merily holds a pointer to the device name string, which is handed over
with request_irq. So you are saying that the core code corrupts the
memory which was handed in via a pointer by the driver?

So now that's really an amazing core feature:

It corrupts the memory with weird characters and still maintains the
PCI bus number correct. So it not only corrupts memory it also moves
the PCI part of the string a few characters to the end.

If the pointer in the irq action would have been corrupted, then you
would see a few weird characters and then the full string, not a
random thing which is half correct and shifted by a few bytes.

The pointer which is handed in is dev-devname, which gets allocated
and filled in drm_pci_set_busid().

 ... then as well calls random handlers.

Which random handlers would be called? The core code only calls
handlers which are associated to an particular interrupt. And only
when that particular interrupt is raised and not because the CPU pulls
interrupt events out of thin air.

And it calls the stupid i915 handler and not something else, otherwise
you would not observe the IIR=0 printk or whatever you put there for
debugging.

 Suspend/resume cycle helps in this case and i915@pci::00:02.0 is
 back in /proc/interrupts as can be seen above.

That's proving what? That the irq core code magically restores the
correct string, right? And probably it stops calling random handlers
as well. Brilliant deduction.

You know what? suspend calls free_irq() via i915_drm_freeze() -
drm_irq_uninstall() and the resume code calls request_irq() again.
free_irq() removes the action and request_irq installs it fresh.

So now the interesting part is that free_irq() checks the dev_id
cookie for a match, which is also stored in the irq action. So we are
dealing with a magic corrupt only action-name and action-handler
problem. Pretty realistic.

What the heck makes you assume that the irq core code is broken?  Core
code, which works on a gazillion of machines and different device
drivers and does not corrupt anything except that i915 thingy?

Come on, you need to provide better evidence than weird ass guessing.

If you're still convinced that the irq core is messing with your
device string, then simply hand in a NULL pointer when requesting the
interrupt. That will make the core code explode nicely when it tries
to modify that memory.

Thanks,

tglx
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

2012-04-06 Thread Jiri Slaby
On 03/30/2012 02:24 PM, Chris Wilson wrote:
 On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby jsl...@suse.cz wrote:
 On 03/30/2012 12:45 PM, Chris Wilson wrote:
 On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jsl...@suse.cz wrote:
 I don't know what to dump more, because iir is obviously zero too. What
 other sources of interrupts are on the (G33) chip?

 IIR is the master interrupt, with chained secondary interrupt statuses.
 If IIR is 0, the interrupt wasn't raised by the GPU.

 This does not make sense, the handler does something different. Even if
 IIR is 0, it still takes a look at pipe stats.
 
 That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close
 a race where the pipestat triggered an interrupt after we processed the
 secondary registers and before reseting the primary.
 
 But the basic premise that we should only enter the interrupt handler
 with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

Ok, this behavior is definitely new. I get several nobody cared about
this interrupt a week. This never used to happen. And something weird
emerges in /proc/interrupts when this happens:
 42:10032921212890   PCI-MSI-edge  �s::00:02.0
instead of
 42:10067151218472   PCI-MSI-edge  i915@pci::00:02.0

It very looks like the generic IRQ handling code is broken. Like it
frees/corrupts irq_desc and then as well calls random handlers.

Suspend/resume cycle helps in this case and i915@pci::00:02.0 is
back in /proc/interrupts as can be seen above.

Running 3.3.0-next-20120326_64+ now.

thanks,
-- 
js
suse labs

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel