http://lwn.net/Articles/274008/

A new suspend/hibernate infrastructure

By Jonathan Corbet
March 19, 2008
While attending conferences, your editor has, for some years, made a point of seeing just how many other attendees have some sort of suspend and resume functionality working on their laptops. There is, after all, obvious value in being able to sit down in a lecture hall, open the lid, and immediately start heckling the speaker via IRC without having to wait for the entire bootstrap sequence to unfold. But, regardless of whether one is talking about suspend-to-RAM ("suspend") or suspend-to-disk ("hibernation"), there are surprisingly few people using this capability. Despite the efforts which have been made by developers and distributors, suspend and hibernate still just do not work reliably for a lot of people.

For your editor, suspend always works, but the success rate of the resume operation is about 95% - just enough to keep using it while inspiring a fair amount of profanity in inopportune places.

Various approaches to fixing suspend and hibernation have been proposed; these include TuxOnIce and kexec jump. Another possibility, though, is to simply fix the code which is in the kernel now. There is a lot that has to be done to make that goal a reality, including making the whole process more robust and separating the suspend and hibernation cases which, as Linus has stated rather strongly several times, are really two different problems. To that end, Rafael Wysocki has posted a new suspend and hibernation infrastructure for devices which has the potential to improve the situation - but at a cost of creating no less than 20 separate device callbacks.

For the (relatively) simple suspend case, there are four basic callbacks which should be provided in the new pm_ops structure by each bus and, eventually, by every device:

    int (*prepare)(struct device *dev);
    int (*suspend)(struct device *dev);

    int (*resume)(struct device *dev);
    void (*complete)(struct device *dev);

When the system is suspending, each device will first see a call to its prepare() callback. This call can be seen as a sort of warning that the suspend is coming, and that any necessary preparation work should be done. This work includes preventing the addition of any new child devices and anything which might require the involvement of user space. Any significant memory allocations should also be done at this time; the system is still functional at this point and, if necessary, I/O can be performed to make memory available. What should not happen in prepare() is actually putting the device into a low-power state; it needs to remain functional and available.

As usual, a return value of zero indicates that the preparation was successful, while a negative error code indicates failure. In cases where the failure is temporary (a race with the addition of a new child device is one possibility), the callback should return -EAGAIN, which will cause a repeat attempt later in the process.

At a later point, suspend() will be called to actually power down the device. With the current patch, each device will see a prepare() call quickly followed by suspend(). Future versions are likely to change things so that all devices get a prepare() call before any of them are suspended; that way, even the last prepare() callback can count on the availability of a fully-functioning system.

The resume process calls resume() to wake the device up, restore it to its previous state, and generally make it ready to operate. Once the resume process is done, complete() is called to clean up anything left over from prepare(). A call to complete() could also be made directly after prepare() (without an intervening suspend) if the suspend process fails somewhere else in the system.

The hibernation process is more complicated, in that there are more intermediate states. In this case, too, the process begins with a call to prepare(). Then calls are made to:

    int (*freeze)(struct device *dev);
    int (*poweroff)(struct device *dev);

The freeze() callback happens before the hibernation image (the system image which is written to persistent store) is created; it should put the device into a quiescent state but leave it operational. Then, after the hibernation image has been saved and another call to prepare() made, poweroff() is called to shut things down.

When the system is powered back up, the process is reversed through calls to:

    int (*quiesce)(struct device *dev);
    int (*restore)(struct device *dev);

The call to quiesce() will happen early in the resume process, after the hibernation image has been loaded from disk, but before it has been used to recreate the pre-hibernation system's memory. This callback should quiet the device so that memory can be reassembled without being corrupted by device operations. A call to complete() will follow, then a call to restore(), which should put the device back into a fully-functional state. A final complete() call finishes the process.

There are still two more hibernation-related callbacks:

    int (*thaw)(struct device *dev);
    int (*recover)(struct device *dev);

These functions will be called when things go wrong; once again, each of these calls will be followed by a call to complete(). The purpose of thaw() is to undo the work done by freeze() or quiesce(); it should put the device back into a working state. The recover() call will be made if the creation of the hibernation image fails, or if restoring from that image fails; its job is to clean up and get the hardware back into an operating state.

For added fun, there are actually two sets of pm_ops callbacks. One is for normal system operation, but there is another set intended to be called when interrupts are disabled and only one CPU is operational - just before the system goes down or just after it comes back up. Clearly, interactions with devices will be different in such an environment, so different callbacks make sense. But the result is that fully 20 callbacks must be provided for full suspend and hibernate functionality. These callbacks have been added to the bus_type structure as:

    struct pm_ops *pm;
    struct pm_ops *pm_noirq;

Fields by the same name have also been added to the pci_driver structure, allowing each device driver to add its own version of these callbacks. For now, the old PCI driver suspend() and resume() callbacks will be used if the pm_ops structures have not been provided, and no drivers have been converted (at least in the patch as posted).

As of this writing, discussion of the patch is hampered by an outage at vger.kernel.org. There are some concerns, though, and things are likely to change in future revisions. Among other things, the number of "no IRQ" callbacks may be reduced. But, with luck, the final resolution will leave us all in a position where suspend and hibernate work reliably.


Hibernation and S4 Grr

Posted Mar 20, 2008 6:27 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

Currently this hibernation solution is overcomplicated.  It allows for using the ACPI S4
state.  Which is a low power state potentially using slightly more power then soft off.  ACPI
S4 allows the hibernating kernel to control in a fine grained manner which devices are
sufficiently alive to wake up the machine.  That is great but something we should worry about
after we get a solid hibernation scheme working.

If you don't worry about ACPI S4 hibernation is much simpler.  As all that is really required
of device drivers is stopping their queues and disconnecting from a device.

Then when the image is restored all you have to do is reconnect the driver to the device.

That is only the:
 int (*freeze)(struct device *dev);
 int (*restore)(struct device *dev);
methods of the proposed interface appear necessary.

Tansitioning to ACPI S4 (or ACPI S5 soft off) after we save the image appear all that is
necessary.


I think the conversation that is starting with pm_ops is a good one.
But I really hope we look carefully at what we are asking the device drivers to do and see if
we can come up with something simple and straight forward for them to implement and maintain.

We have a lot of similarity in the hibernation ops, the hotplug ops, the driver load and
unload ops, and the reboot shutdown ops.  It would be cool if we could identify some key
functionality that we are performing and reduce the work that a driver author needs to do, to
test and implement the driver.




A new suspend/hibernate infrastructure

Posted Mar 20, 2008 8:23 UTC (Thu) by AndyBurns (subscriber, #27521) [Link]

I know a name is only a name, but the "quiesce" call is something I'd expect when the system
was on the way down, not on the way back up.


A new suspend/hibernate infrastructure

Posted Mar 20, 2008 10:25 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

I've had more or less working suspend & hibernate on this Z60m for some time now, and before
that on an X31.

Both machines have suffered some problems from kernel bugs or (more rarely but often longer
lasting) userspace trouble. In fact the X31's trouble was dominated by a faulty lid switch
(poor design by IBM) which took to declaring that the lib had been opened while the laptop was
in fact closed and supposedly asleep, waking the machine while inside bags or being carried
about and creating a fire hazard. This bug, at least, can't be laid at the door of any Free
Software developers.

The most serious bug in the last six months was losing ACLs on my audio devices after a
restore. This was found to be a race condition in userspace software and fixed, but not
without months of annoyance.

But always it comes back to this: Suspending, especially to RAM, is merely a convenience. So
it has to be /really/ reliable to be worth having. A lot of the time, my laptops weren't in
the state where I could claim that, but it has definitely been getting better.

Oh, and to address the initial statistic offered, it occurs to me that a lot of people might
be just one or two drivers from working suspend. So the effort to get from say 10% of
attendees having working suspend to 90% may actually just be concentrated in one or two key
places. The nVidia drivers, of all sorts, seem to be notoriously twitchy about suspending.

A new suspend/hibernate infrastructure

Posted Mar 20, 2008 13:39 UTC (Thu) by nescafe (subscriber, #45063) [Link]

If you are running a fairly recent nvidia binary driver (100.x.x or higher), most of the
flakiness is attributable to the quirks that the suspend/resume infrastructure runs, which
duplicate (badly) or race with the tasks the nvidia driver performs.  Once I got rid of those,
suspend/resume worked great on my system.

A new suspend/hibernate infrastructure

Posted Mar 20, 2008 16:17 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

The fact that we confuse the driver methods for putting the hardware in a low power state
(suspending) and the driver methods for quiescing a device and being prepared for it to
disappear results in unfixable infrasturcture bugs and unclear semantics.

So the current set of suspend/resume and hibernate operations must be
reexamined if we are to have something that is sane, and generally implentable.

Just a few more drivers is a nice idea.  But the driver authors can't do that if we don't have
clear expectations of what the functions they are supposed to implement should do.

In fact except for some goofy corner cases.  We should be able to get away with no operations
for hibernation.  The fact that the last propsal has more driver methods for hibernation shows
that the design while it may be sane from the infrastructure point of view.  Still has a ways
to go before it makes sense from a driver and mainteance point of view.




A new suspend/hibernate infrastructure

Posted Mar 20, 2008 22:18 UTC (Thu) by iabervon (subscriber, #722) [Link]

In order to really make sense, the hibernate description needs to note that the restoration
process works by booting the system into a state that's able to load the stored state and then
transfer control to it.

The reason for "quiesce" is that the drivers used in the temporary system have to finish up
what they're doing before they get replaced with the loaded image and put the devices into a
state such that the drivers from the saved image (which have no idea what the temporary system
did) can get them working again in some sane fashion. That is, the kernel image that the user
cares about never calls quiesce() at all, either before or after shutting down.

A new suspend/hibernate infrastructure

Posted Jul 25, 2008 5:46 UTC (Fri) by nikanth (subscriber, #50093) [Link]

The effort being put in suspend seems to have helped me atleast. s2r as well as s2d works for
me on openSuse11.0 on T60p.


Reply via email to