|
http://lwn.net/Articles/310739/ Adding a system call to the kernel is never done lightly. It is important to get it right before it gets merged because, once that happens, it must be maintained as part of the kernel's binary interface forever. The proposal to add preadv() and pwritev() system calls provides an excellent example of the kinds of concerns that need to be addressed when adding to the kernel ABI. The two system calls themselves are quite straightforward. Essentially, they combine the existing pread() and readv() calls (along with the write variants of course) into a way to do scatter/gather I/O at a particular offset in the file. Like pread(), the current file position is unaffected. The calls, which are available on various BSD systems, can be used to avoid races between an lseek() call and a read or write. Currently, applications must do some kind of locking to prevent multiple threads from stepping on each other when doing this kind of I/O. The prototypes for the functions look much like readv/writev, simply adding the offset as the final parameter: ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);
But, because off_t is a 64-bit quantity, this causes problems
on
some architectures due to the way system call arguments are
passed. After Gerd Hoffmann posted version 2
of the patchset, Matthew Wilcox was quick to point out a problem:
Are these prototypes required? MIPS and PARISC
will need wrappers to
fix them if they are. These two architectures have an ABI which
requires 64-bit arguments to be passed in aligned pairs of registers,
but glibc doesn't know that (and given the existence of syscall(3),
can't do much about it even if it knew), so some of the arguments end
up
in the wrong registers.
Several other architectures (ARM, PowerPC, s390, ...) have similar constraints. Because the offset is the fourth argument, it gets placed in the r3 and r4 32-bit registers, but some architectures need it in either r2/r3 or r4/r5. This led some to advocate reordering the parameters, putting the offset before iovcnt to avoid the problem. As long as that change doesn't bubble out to user space, Hoffmann is amenable to making the change: "I'd *really* hate it to have the same system call with different argument ordering on different systems though". Most seemed to agree that the user-space interface as presented by glibc should match what the BSDs provide. It causes too many headaches for folks trying to write standards or portable code otherwise. To fix the alignment problem, the system call itself has the reordered version of the arguments. That led to Hoffmann's third version of the patchset, which still didn't solve the whole problem. There are multiple architectures that have both 32 and 64-bit versions and the 64-bit kernel must support system calls from 32-bit user-space programs. Those programs will put 64-bit arguments into two registers, but the 64-bit kernel will expect that argument in a single register. Because of this, Arnd Bergmann recommended splitting the offset into two arguments, one for the high 32 bits and one for the low: "This is the only way I can see that lets us use a shared compat_sys_preadv/pwritev across all 64 bit architectures". When a 32-bit user-space program makes a system call on a 64-bit system, the compat_sys_* version is used to handle differences in the data sizes. If a pointer to a structure is passed to a system call, and that structure has a different representation in 32-bits than it does in 64-bits, the compat layer makes the translation. Because different 64-bit architectures do things differently in terms of calling conventions and alignment requirements, the only way to share compat code is to remove the 64-bit quantity from the system call interface entirely. That just leaves one final problem to overcome: endian-ness. As Ralf Baechle notes, MIPS can be either little or big-endian, so the compat_sys_preadv/pwritev() needs to put the two 32-bit offset values together in the proper way. He recommended moving the MIPS-specific merge_64() macro into a common compat.h include file, which could then be used by the common compat routines. So far, version 4 of the patchset has not emerged, but one suspects that the offset argument splitting and use of merge_64() will be part of it. The implementation of the operation of preadv() and pwritev() is very obvious, certainly in comparison to the intricacies of passing its arguments. The VFS implementations of readv()/writev() already take an offset argument, so it was simply a matter of calling those. It is interesting to note that as part of the review, Christoph Hellwig spotted a bug in the existing compat_sys_readv/writev() implementations which would lead to accounting information not being updated for those calls. This is not the first time these system calls have been proposed; way back in 2005, we looked at some patches from Badari Pulavarty that added them. Other than a brief appearance in the -mm tree, they seem to have faded away. Even if this edition of preadv() and pwritev() do not make it into the mainline—so far there are no indications that they won't—the code review surrounding it was certainly useful. Getting a glimpse of the complexities around 64-bit quantities being passed to system calls was quite informative as well. |
