[Posted September 26, 2005 by corbet]

Suspend-to-disk is a feature desired by many Linux users; both laptop and desktop users can benefit from being able to save the state of the system to a local drive and, after a reboot, find everything as they left it. The current in-kernel suspend mechanism works for many, but not everybody is comfortable with the large amount of invasive code required. The out-of-tree suspend2 implementation adds quite a few worthwhile features, but at the cost of expanding the software suspend implementation still further. Concern over putting some of the suspend2 features into the kernel has been one of the factors preventing its merging so far.

Pavel Machek, the maintainer of the in-kernel suspend implementation, has now complicated the pictured with the swsusp3 patch, which moves some of the work of suspending the system into user space. This code is said to work; if this approach continues to show promise, it could point the way toward adding suspend2's features without growing the kernel.

The software suspend process, in very rough terms, works like this:

All processes on the system (with a few exceptions) are put into a special "frozen" state.
Any memory which has on-disk backing store is forced out to disk; this step essentially clears the system of all user-space pages. Any kernel memory which can be done without - caches and such - is also dropped.
Any remaining memory which is not in reserved space (not part of the kernel text, for all practical purposes) is written to a suspend image on the disk. Also written is a map saying where the pages came from in the first place.
The system is shut down.

When the system is resumed, these steps are reversed in the opposite order - except that user-space memory remains on disk until faulted in by the newly-restarted system.

The swsusp3 patch does not move all of the above work to user space - much of it must be done in the kernel. What does move is step 3 - the writing of kernel memory - to disk. This operation is handled by way of /dev/kmem. To that end, the swsusp3 patch adds a set of scary ioctl() calls to the /dev/kmem driver.

The new user-space suspend program begins by locking itself into memory. This step is required - it would not do for it to change the memory state in the middle of the process via page faults. A call to the new IOCTL_FREEZE operation on /dev/kmem performs the first two steps listed above: freezing processes and clearing memory. The IOCTL_ATOMIC_SNAPSHOT call then puts devices on hold and creates an in-kernel list of pages which must be saved.

The ioctl(/dev/kmem, IOCTL_ATOMIC_SNAPSHOT) call returns a pointer to that list of pages. The user-space program can then obtain the list (by reading it from /dev/kmem) and pass through it. Each page on the list is read from kernel memory and written to the suspend image file. Finally, the list itself is written to the suspend image. Once that is done, the system can be powered down.

The resume process writes the saved image back into kernel memory. It has the additional problem, however, of having to deal with two kernels at once. This process will be running under a freshly-booted kernel (the "resume kernel") with its own idea of the state of the world; that state will eventually be overwritten by the state from the suspended kernel, but that step must be handled carefully. The resume process cannot simply overwrite arbitrary kernel memory, since it is counting on the resume kernel to continue to function until all of the suspended kernel's memory has been read in. So the user-space resume process must be able to allocate pages in kernel space.

The answer is, of course, another ioctl() command, IOCTL_KMALLOC, which executes a get_zeroed_page() call and returns the address of the resulting page to user space. Once a full set of pages has been loaded with the suspended kernel's memory, an updated page map can be stored in the kernel, and an IOCTL_ATOMIC_RESTORE operation tells the resume kernel to finish the process.

This code is very much in an early stage; even people who do not hesitate to use software suspend may want to be careful with swsusp3 on systems they actually care about resuming. Once things settle down, however, swsusp3 could open the door to a number of features, including graphical progress displays and the ability to interrupt the suspend process, which users have been asking for.

(Log in to post comments)

User-space software suspend

Posted Sep 29, 2005 9:13 UTC (Thu) by hawk (subscriber, #3195) [Link]

One of the fairly big benefits of swsusp2 is that it doesn't do away with any memory that can be done away with. Doing so may be ideal from some point of view (probably simplifies stuff), but it is definitely not ideal for the user!

After a suspend/resume cycle with swsusp2 (which is actually slightly quicker than a swsusp1 cycle!) the machine is in the same state as at was before suspending, it still has the running programs in-memory, stuff cached, etc.

Swsusp1 may work "just as well" (for me at least), but it puts the system back in a very sorry state, where the system is on the verge of being unusable for some time after resuming.

User-space software suspend

Posted Sep 29, 2005 11:45 UTC (Thu) by rise (subscriber, #5045) [Link]

Good points, though I'd like to note that in my experience a suspend2 suspend & resume cycle is much faster than a swsusp1 cycle even with keeping cache and buffers. Suspend2 also has the option to throw away both, which dramatically speeds up the cycle at the cost of an system that's initially a bit sluggish after resume as it faults everything back in - though no worse than swsusp1.

User-space software suspend

Posted Sep 30, 2005 3:03 UTC (Fri) by zblaxell (subscriber, #26385) [Link]

I do like the fact that swsusp2 resumes with caches and buffers intact. If I wanted to wait while the system painfully restored this data one 4K page fault from swap at a time, I might as well reboot--it could actually be faster.

On the other hand, I generally like to run a small application before suspending, which allocates memory until a few hundred pages are swapped (it is a loop of malloc() and reading paging statistics out of /proc), then exits. This dumps out some of the more useless 400MB or so of caches on my system, and cuts resume time in half (it does add a second or two to suspend), without the extreme pain of having to swap _everything_ back in on resume.

I'm not sure what benefit there is in pushing too much of the suspend and resume functions into user space. After a while we start to need a whole lot of system calls to tell the kernel which of its "user space" processes are in fact absolutely critical to the continued functioning of the kernel, at which point IMHO it would be much simpler, safer, smaller, and swifter to just push the whole thing back into kernel-space. If you combine user-space suspend and resume with user-space block devices, user-space network devices, user-space encryption (on either), user-space device configuration, network storage devices, and device drivers that live partly or entirely in user-space, there's a whole lot of stuff that is just bouncing back and forth between user-space and kernel-space with no really sane reason to do so other than "we don't have to do all of it in the kernel."

In one special case of user space--monolithic user-space applications--there is a similar question of what to include in the main application's space and what to farm out to other processes. Sometimes the monolithic application is even called a "kernel." One solution in common between the Linux kernel and other large applications is to dynamically load code into the application's address space (.ko's or .so's). Another solution is to initiate another process with a separate address space, then communicate with the main application over some kind of IPC (netlink, /proc, /sys, dbus, hotplug, mmap...or sockets, pipes, shared memory, mmap).

There is a third option which is used by big applications but not the Linux kernel: embedded interpreted languages. Modern applications, once they cross a certain size threshold, tend to suddenly sprout a language interpreter to cope with their more advanced configuration options (where "configuration" sometimes amounts to "when I press this button, execute 1500 lines of custom workflow code"). Things like netfilter get close to this--iptables is almost Turing-complete, the chains are analogous to functions, some of the experimental netfilter modules implement dictionary lookups analogous to variables, and the non-experimental modules can do basic boolean logic on packets combining the results from multiple rules, as long as you don't need more than 8 levels of nesting or 32 bits of storage per packet. Netfilter in particular could benefit a lot from having a compiler in user-space generate an optimized (not every netfilter chain entry *needs* to look at the source and destination network/netmask, but they do nonetheless) bytecode (or even machine code) filter configuration, then pushing that code into a much simpler kernel-space implementation. I'm surprised the Linux kernel doesn't have at least one interpreted configuration language, not even as a module--other Unixish kernels and their bootloaders do.

Most of the time, the only advantage I ever see from having things like root filesystem configuration, device mapping, encryption, firmware loading, etc. configured from or provided by user-space is that it is then possible to do non-trivial configurations or experimental implementations. For example, the md-RAID setup allows a number of straightforward RAID configurations to be set up automatically by the kernel, while the device-mapper and other LVM flavors are configured from user-space and can (in theory) be a lot more flexible. Another example is encrypted filesystem setup, where you almost certainly want to have a custom user-space script to retrieve the decryption keys from whatever they're stored on, match them up with the right partitions, and of course collect the passphrase from the console. All this stuff can easily be handled by even a minimal scripting language with the right set of primitives--most of which would just be wrappers around existing kernel code, e.g. open() or sha1().

I currently do this kind of userspace configuration on an initrd with busybox (almost but not quite as painful as custom C code), custom binaries (which are comparatively hard to fix when they break, unless you have the presence of mind to keep a working development environment on a bootable CD with you at all time), or even shell scripts (which work, but take up megabytes of space for the 99% of the code you're not using). IMHO they all suck. The amount of stuff that I have to put into the initrd keeps getting bigger while the amount of stuff in the kernel keeps getting...well, bigger, and yet the amount of stuff that the kernel can do without help from user-space seems to be decreasing with each new major kernel subsystem. Also, I have to go through some weird flaky gymnastics to reconfigure user space (pivot_root and real-root-dev come to mind here) without leaving dangling references to multi-megabytes of initrd crap taking up RAM and swap. I'd rather just put 20K of some simple script language runtime into the kernel, have the kernel read and execute a 4K boot script, and be done with it. It can't take more than that much code to prompt for a password, run it through the appropriate salt and hash functions, set up two loop device AES keys, then exec "/sbin/init".

User-space software suspend

Posted Oct 6, 2005 17:36 UTC (Thu) by peschmae (guest, #32292) [Link]

> I do like the fact that swsusp2 resumes with caches and buffers intact.
> If I wanted to wait while the system painfully restored this data one 4K
> page fault from swap at a time, I might as well reboot--it could actually
> be faster.

Me too. But on my machine (laptop - harddisk is slow) rebooting would still be slower ;-)

> On the other hand, I generally like to run a small application before
> suspending, which allocates memory until a few hundred pages are swapped
> (it is a loop of malloc() and reading paging statistics out of /proc),
> then exits. This dumps out some of the more useless 400MB or so of caches
> on my system, and cuts resume time in half (it does add a second or two
> to suspend), without the extreme pain of having to swap _everything_ back
> in on resume.

Isn't that exactly what the # ImageSizeLimit 200 item in hibernate.conf (or the /proc/software_suspend/image_size_limit respectively) are there for?

Does your way of doing the more or less same thing have an advantage over that? (Faster maybe?)

> I'm not sure what benefit there is in pushing too much of the suspend and
> resume functions into user space.

I agree here. Because it still seems to need very much code in the kernel - only a minimal part is user space application.
And I don't really like it if the kernel depends on user space apps to boot - only makes for trouble (the tool has to be on the initrd (I don't like initrds anyway - at least not for my custom built kernels))

Peschmä

User-space software suspend

Posted Oct 6, 2005 19:11 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

Normally suspend2 writes all non-free pages (including clean cache pages and cached swap pages). This is a bit annoying for me, since 90% of the time I use less than 40% of my laptop's memory, but I have to wait for the other 60% of the RAM to be read and written at suspend and resume time.

ImageSizeLimit is an upper bound on the image size. If the image would be larger than this, then there is a pre-suspend forcing of pages--dirty or not--to disk. If the value is not dynamically chosen, it is inefficient--too high, and unnecessary pages are written in the suspend image; too low, and suspend and resume time is significantly increased since a bunch of stuff has to be swapped out before suspend and back in after resume, and page for page the swapper is much slower than Suspend2's image writer. Dynamically choosing the value is apparently non-trivial...at least I tried to do it for a while, then gave up.

My application forces all the clean pages (600MB as I write this) to go away, without losing active program text pages or forcing dirty pages to swap. It stops as soon as there are more than 100 pages written to swap since the program started running, so it does not significantly extend the suspend time (a few hundred pages are swapped before the application notices and exits, which does take a second or so).

This approach doesn't need prior configuration--it automatically discovers just how much RAM can be cheaply freed by allocating as much as the system can spare without swapping, then it exits and leaves thousands of free pages.

Without all the extra pages, the suspend image is much smaller, so suspend and resume are faster. Since only a few dirty or active pages were actually swapped, it doesn't noticeably slow down the machine after resume (there is more overhead when xscreensaver wakes up after noticing the wall clock time jumping well past the inactivity threshold, than there is from post-resume swapping ;-).

User-space software suspend

Posted Oct 30, 2005 1:51 UTC (Sun) by NinjaSeg (subscriber, #33460) [Link]

Errr, care to share it with us?

User-space software suspend

Posted Nov 4, 2005 0:43 UTC (Fri) by zblaxell (subscriber, #26385) [Link]

#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);

sub swapfree {
open(PROC, "/proc/meminfo") or die "open: /proc/meminfo: $!";
my ($swapfree) = grep(/^SwapFree:/, <PROC>);
close(PROC);
$swapfree =~ s/\D+//gos;
print STDERR "swapfree=$swapfree\n";
return $swapfree;
}

my $last_swapfree = swapfree;
my @blobs;

my $count = 0;
my $total = 0;

my $start_time = time;

while ($last_swapfree <= (my $new_swapfree = swapfree)) {
++$count;
push(@blobs, ('.' x (1048576 * $count)));
$total += $count;
print STDERR "${total}M allocated\n";
$last_swapfree = $new_swapfree;
}
system("ps m $$");
print STDERR time - $start_time, " seconds\n";

User-space software suspend

Posted Sep 30, 2005 16:41 UTC (Fri) by richardfish (guest, #20657) [Link]

Could not agree more! The biggest reason I prefer suspend2 is because it preserves cached memory.

User-space software suspend

Posted Oct 6, 2005 15:31 UTC (Thu) by quintesse (subscriber, #14569) [Link]

How much longer is this going to take? It's the year 2005 for god's sake and Linux still has no perfectly working suspend/hibernate? It's really one of the few things that drives me nuts at times about Linux (re-installing all kernel modules for a new kernel is the other).

NB: But I think I was succesful in convincing the maintainer of the ATrpms to include swsusp2-enabled kernels in his repository so hopefully I won't have to worry about swsusp2 anymore in the near future :-)

NB: Now to convince NVidia to make their drivers suspend-compatible!

User-space software suspend

Posted Nov 7, 2005 22:01 UTC (Mon) by lacostej (subscriber, #2760) [Link]

> NB: Now to convince NVidia to make their drivers suspend-compatible!

I sure need that as well. Come on nvidia! I have a 3.5 years old Dell laptop and suspend to disk never worked!

User-space software suspend

Posted Apr 8, 2006 20:51 UTC (Sat) by lacostej (subscriber, #2760) [Link]

I would like to update my statement.

After upgrading to Ubuntu dapper drake test flight 6 and following this: https://wiki.ubuntu.com/NvidiaLaptopBinaryDriverSuspend

I am finally able to suspend to RAM (and probably disk as well).

I've tested this with the latest nvidia kernel and madwifi (all installed by ubuntu) while on a wireless connection with Skype software on.

Suspended. Waited for 30 seconds. Reopenend the machine, tried to call laptop using Skype from another PC, and It Just Worked.

Finally. Too bad the machine (Inspiron 8100) is getting really tired (4 years old, one dead battery, one dead USB, one dead PCMCIA, dead CD/DVD drive). That's without counting the 2 replaced hard disks and 2 replaced motherboards, + the dead keyboard.

Nevertheless it now finally works. I'll open a beer to celebrate.

[linuxkernelnewbies] User-space software suspend [LWN.net]

Reply via email to