http://lwn.net/Articles/172149/Robust futexes - a new approachThere is one problem with keeping the kernel out of the picture, however. If a process comes to an untimely end while holding a futex, there is no way to release that futex and let other processes know about the problem. The SYSV semaphore mechanism - a much more heavyweight facility - has an "undo" mechanism which can be called into play in this sort of situation, but there is no such provision for futexes. As a result, a few different "robust futex" patches have been put together over the past years; LWN looked at one of them in January, 2004. These patches have tended to greatly increase the cost of futexes, however, and none have been accepted into the mainline. Ingo Molnar, working with Thomas Gleixner and Ulrich Drepper, has tossed aside those years' worth of work and, in a couple of days, produced a new robust futex patch which, he hopes, will find its way into the mainline. The new patch has the advantage of being fast, but, as Ingo notes: Be warned though - the patchset does things we
normally dont do in Linux, so some might find the approach disturbing.
Parental advice recommended ;-)
The fundamental problem to solve is that the kernel must, somehow, know about all futexes held by an exiting process in order to release them. A past solution has been the addition of a system call to notify the kernel of lock acquisitions and releases. That approach defeats one of the main features of futexes - their speed. It also adds a record-keeping and resource limiting problem to the kernel, and suffers from some problematic race conditions. So Ingo's patch takes a different approach. A list of held futexes is maintained for each thread, but that list lives in user space. All the thread has to do is to make a single call to a new system call: long set_robust_list(struct robust_list_head *head, size_t size); That call informs the kernel of the location of a linked list of held futexes in the calling process's address space; there is also a get_robust_list() call for retrieving that information. Typically, this call would be made by glibc, and never seen by the application. Glibc would also take on the task of maintaining the list of futexes. When a process dies, the kernel looks for a pointer to a user-space futex list. Should that pointer be found, the kernel will carefully walk through it, bearing in mind that, as a user-space data structure, it could be accidentally or maliciously corrupt. For each held futex, the kernel will release the lock and set it to a special value indicating that the previous holder made a less-than-graceful exit. It will then wake a waiting process, if one exists. That process will be able to see that it has obtained the lock under dubious circumstances (user-space functions like pthread_mutex_lock() are able to return that information) and take whatever action it deems to be necessary. The kernel will release a maximum of one million locks; that keeps the kernel from looping forever on a circular list. Given the practical difficulties of making a million-lock application work at all, that limit should not constrain anybody for quite some time. There is still a race condition here: if a process dies between the time it acquires a lock and when it updates the list, that lock might not be released by the kernel. Getting around that problem involves a bit of poor kernel hacker's journaling. The head of the held futex list contains a single-entry field which can be used to point to a lock which the application is about to acquire. The kernel will check that field on exit, and, if it points to a lock actually held by the application, that lock will be released with the others. So, if glibc sets that field before acquiring a lock (and clears it after the list is updated), all locks held by the application will be covered. The discussion on this patch was just beginning when this article was written. There is some concern about having the kernel walking through user-space data structures; the chances of trouble and security problems are certainly higher when that is going on. Other issues may yet come up as well. But, since this is clearly not a 2.6.16 feature in any case, there will be time to talk about them. |