ACPI, device interrupts, and suspend states

[Posted August 3, 2005 by corbet]

The 2.6.13-rc5 prepatch brought with it the reversal of a couple of ACPI-related patches. A look at what happened is rewarding in that it shows how hard it can be to get some things right, and how the kernel development model tries to address these issues.

Earlier 2.6.13 prepatches included a change to the core ACPI system. Whenever the system (or a part of it) is being suspended, the modified ACPI code would break the link which routed device interrupts into the processor. This change is part of a new set of rules which expects every device to release its interrupt line on suspend, and to reacquire it on resume. There are a few reasons for wanting to do things this way:

In theory, at least, a device could be resumed to find that its interrupt number has changed. People who reconfigure their hardware while the system is suspended (as opposed to being truly shut down) might be seen as actively looking for trouble, but it still might be nice to make things work for them when possible.
The interrupt handler for a suspended device should not normally be called, but that can happen in the case of shared interrupts. Any interrupt handler which tries to access a suspended device is likely to run into problems; having every suspend() method release the device's interrupt line can help to avoid this situation.
On resume, interrupts for a device whose driver has not yet been resumed may be seen as spurious and shut down. If that interrupt line is shared, however, other devices could be affected. This problem can be avoided by having ACPI shut down the interrupt altogether until individual drivers restore it, but that depends on drivers explicitly reallocating their interrupt lines.

The problem with the ACPI change is that it breaks a large number of drivers, and, as a result, it breaks suspend on systems where it used to work. The power management hackers seem to see this situation as an unfortunate, but necessary step toward getting suspend working reliably on a much broader range of hardware. Having individual drivers release and reacquire their interrupts is also seen as necessary to support runtime power management - suspending of individual devices in a running system to save power. The ACPI change, it is said, fixes more systems than it breaks, and is thus worthwhile.

Linus disagreed and reverted the patch, saying:

The thing is, we're better off making very very slow progress that is _steady_, than having people who _used_ to have things work for them suddenly break.

So I believe that if we fix two machines and break one machine, we've actually regressed. It doesn't matter that we fixed more than we broke: we _still_ regressed. Because it means that people can't trust the progress we make!

The right solution, according to Linus, is to go ahead and add the free_irq() and request_irq() calls to individual drivers when it makes sense to do so, and when it does not break things for individual users. Meanwhile, however, the ACPI subsystem should still restore the interrupt state on resume so that unmodified drivers do not break. There are some remaining issues with how that is done: it may involve running the ACPI AML interpreter with interrupts disabled, which leads to a number of interesting situations. Benjamin Herrenschmidt also pointed out that it could lead to situations where drivers may not be able to receive interrupts during the resume process.

Eventually, one assumes, these details will be worked out. In the mean time, it will be interesting to see if the "revert any patch that breaks somebody's machine" policy holds. If it leads to a more stable experience for Linux users, it seems like it would be a good thing.

ACPI, device interrupts, and suspend states

Posted Aug 4, 2005 9:55 UTC (Thu) by NAR (subscriber, #1313) [Link]

So this could be the reason of those "IRQ 11 nobody cared" (or something like this) error messages I got with an accompanying stack trace... As always, LWN's Kernel Page worths every cent of my subscription fee.

Bye,NAR

ACPI, device interrupts, and suspend states

Posted Aug 5, 2005 19:03 UTC (Fri) by zblaxell (subscriber, #26385) [Link]

There are ways to make progress without so much disruption. Believe it or not, people do notice when the kernel starts spitting out printk messages that it didn't spit out before, especially if those messages describe what has to be changed, or who has to be notified, and if those messages are triggered many times by doing something routine that the user relies on.

Something along the lines of "driver XXX didn't call free_irq before suspend, please fix it" would be nice. I'd even sift through a bunch of "driver XXX _did_ call request_irq after resume" messages to ensure that all the devices in my system were accounted for, and maybe even fix the ones that aren't.

I'd be willing to run the latest kernel on most of my less essential machines, if I could expect the latest kernel to discover and complain about expected future breakage in my specific circumstances, and if I can expect most of the stuff that worked in the previous kernel to work in the next.

On the other hand, if the kernel developers' approach to API and policy changes is to just commit a patch and let everyone else catch up with fixes to things that--although possibly fragile and ugly--were not actually broken in the first place, then I'll only run those kernels when they're well and truly finished (i.e. never), or when I have absolutely no alternative.

ACPI, device interrupts, and suspend states

Posted Sep 9, 2005 1:50 UTC (Fri) by mmarq (guest, #2332) [Link]

" In the mean time, it will be interesting to see if the "revert any patch that breaks somebody's machine" policy holds. If it leads to a more stable experience for Linux users, it seems like it would be a good thing. "

Sorry,... but that to me and perhaps 90% of users out there, from server to desktop, is dont change from kernel altogheter until is proved reliable and safe. I'm not trying to annoy no one, but it still seems to me that there is a slight disconnection betwen the developement world and the user world.

More and better modularity would adress some issues, until then is one kernel altogether or one another... no patches in the middle.

[lk] ACPI, device interrupts, and suspend states [LWN.net]

ACPI, device interrupts, and suspend states

Reply via email to