I am sponsoring this fast-track case for myself.
This proposal will time out 07/11/2008
All of the proposed interfaces are specified by Posix (SUSv3),
so the stability level is 'Standard' in all cases.
The proposed release binding is "minor release"
since no one wants this to be back-ported to Solaris 10.
The change being proposed here is to implement the interfaces,
specified by the Posix SUSv3 standard, necessary to support
the _POSIX_ADVISORY_INFO option group in Solaris.
This includes implementing the libc functions:
posix_fadvise()
posix_fallocate()
posix_madvise()
posix_memalign()
and activating the [f]pathconf() variables:
_____________________________________________________
| {POSIX_ALLOC_SIZE_MIN} | _PC_ALLOC_SIZE_MIN |
|___________________________|________________________|
| {POSIX_REC_INCR_XFER_SIZE}| _PC_REC_INCR_XFER_SIZE|
|___________________________|________________________|
| {POSIX_REC_MAX_XFER_SIZE} | _PC_REC_MAX_XFER_SIZE |
|___________________________|________________________|
| {POSIX_REC_MIN_XFER_SIZE} | _PC_REC_MIN_XFER_SIZE |
|___________________________|________________________|
| {POSIX_REC_XFER_ALIGN} | _PC_REC_XFER_ALIGN |
|___________________________|________________________|
These [f]pathconf() interfaces already exist, but they all
return -1 and set errno to EINVAL.
The posix_fallocate() interface was implemented in solaris_nevada
for the UFS file system as part of the PSARC case:
PSARC 2004/422 posix_fallocate
Unfortunately, it was not implemented with proper Posix-specified
error return values, so it has to be fixed up a bit.
Also, unfortunately, the interface was implemented only for the
UFS file system. Further work is needed to implement this interface
for other file systems, notably ZFS and NFS. The NFS file system
needs to be taught to send the new request over the wire, as it
already does for ftruncate(). These are future projects.
There is no requirement to implement posix_fallocate() for all file
systems. The posix_fallocate() specification allows for this error:
EINVAL The underlying file system does not support this operation.
The posix_fadvise(), posix_madvise(), and posix_memalign() interfaces
do not yet exist.
The initial implementation of posix_fadvise() will do nothing
other than return proper error values. This is OK because the
SUSv3 specification doesn't require it to do anything. This
just provides the infrastructure for some future project to
use to optimize I/O performance.
The posix_madvise() function will just call the existing madvise()
function. The posix_memalign() function will just call the existing
memalign() function.
These header files require additions:
<fcntl.h> add declarations for posix_fadvise() and posix_fadvise64()
<stdlib.h> add declaration for posix_memalign()
<unistd.h> add definition of _POSIX_ADVISORY_INFO (200112L)
<sys/fcntl.h> add definitions of advice values for posix_fadvise()
<sys/mman.h> add declaration for posix_madvise()
add definitions of advice values for posix_madvise()
The values returned for the [f]pathconf() variables are both system-
and filesystem-dependent. The proposed values are best described by
these comments from usr/src/uts/common/syscall/pathconf.c :
case _PC_ALLOC_SIZE_MIN:
case _PC_REC_INCR_XFER_SIZE:
case _PC_REC_MAX_XFER_SIZE:
case _PC_REC_MIN_XFER_SIZE:
case _PC_REC_XFER_ALIGN:
/*
* There is generally no harm in doing larger transfers, but
* there's a point of diminishing returns. With 1MB transfers,
* even if they're random, you get very close to platter speed.
* Se we return 1MB as the maximum transfer size.
*/
if (cmd == _PC_REC_MAX_XFER_SIZE)
return ((long)MAX(sb.f_bsize, 1UL << 20));
/*
* By definition, f_frsize is the smallest filesystem block.
* However, _PC_ALLOC_SIZE_MIN is intended to define the
* threshold for direct I/O. This implies two requirements:
* the VM and I/O subsystems must be able to create mappings
* for DMA, which requires at least page alignment; and the
* filesystem must avoid read/modify/write, which generally
* requires multiples of its 'preferred' blocksize, f_bsize.
*
* PAGESIZE alignment is sufficient for DMA and block copy.
* Rounding up to the filesystem 'preferred' blocksize
* works just as well.
*
* All together, this means that the remaining parameters
* map into the same value.
*/
return ((long)MAX(sb.f_bsize, PAGESIZE));
See the materials directory for the manual pages:
posix_fadvise.3c
posix_fallocate.3c
posix_madvise.3c
posix_memalign.3c
These are copies of the Posix SUSv3 specification pages, with
minor changes such as changing 'shall' to 'will' and specifying
what the 'implementation-defined' behaviors are for Solaris.
For reference, the Posix SUSv3 specification pages are included as:
posix_fadvise.susv3
posix_fallocate.susv3
posix_madvise.susv3
posix_memalign.susv3
Roger Faulkner