Roch wrote: > Filesystems might have some blocksize and alignment constraints > conditioning their ability to loan up buffers (for writes). > If that is so, we could use an API to query the FS about > those values. For a copy on write & variable block size > filesystem, that natural blocksize might also depend on the > vnode being targetted. Yes. The provider can fail the VOP_REQZCBUF() call if it determines that it is inefficient to take the zero-copy path. Depending on the provider implementation, this could be blocksize aligned. In such cases, the consumer could use VFSNAME_STATVFS() call to determine 'f_bsize' value. But as you note, certain implementations may have different values for individual files. In such cases if the VOP_REQZCBUF() fails, the consumer then uses the traditional non zero-copy path.
An additional API to find the such constraints/requirements may be useful in future, but is out-of-scope for this project. However, the project team will open an RFE for this issue and put you on the interest list. > Do we know if ZFS will ever be able to > loan up buffers for writes that are not aligned full records ? > No, not planned currently. It has to be block size aligned. Also note that currently, from an implementation perspective, zero-copy WRITEs are efficient only in case network-based filesystems like NFS over RDMA transports. Mahesh > -r > > Rich.Brown at Sun.COM writes: > > I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. > > This case proposes new interfaces to support copy reduction in the I/O path > > especially for file sharing services. > > > > Minor binding is requested. > > > > This times out on Wednesday, 16 September, 2009. > > > > > > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI > > This information is Copyright 2009 Sun Microsystems > > 1. Introduction > > 1.1. Project/Component Working Name: > > Copy Reduction Interfaces > > 1.2. Name of Document Author/Supplier: > > Author: Mahesh Siddheshwar, Chunli Zhang > > 1.3 Date of This Document: > > 09 September, 2009 > > 4. Technical Description > > > > == Introduction/Background == > > > > Zero-copy (copy avoidance) is essentially buffer sharing > > among multiple modules that pass data between the modules. > > This proposal avoids the data copy in the READ/WRITE path > > of filesystems, by providing a mechanism to share data buffers > > between the modules. It is intended to be used by network file > > sharing services like NFS, CIFS or others. > > > > Although the buffer sharing can be achieved through a few different > > solutions, any such solution must work with File Event Monitors > > (FEM monitors)[1] installed on the files. The solution must > > allow the underlying filesystem to maintain any existing file > > range locking in the filesystem. > > > > The proposed solution provides extensions to the existing VOP > > interface to request and return buffers from a filesystem. The > > buffers are then used with existing VOP_READ/VOP_WRITE calls with > > minimal changes. > > > > > > == Proposed Changes == > > > > VOP Extensions for Zero-Copy Support > > ======================================== > > > > a. Extended struct uio, xuio_t > > > > The following proposes an extensible uio structure that can be extended > for > > multiple purposes. For example, an immediate extension, xu_zc, is to be > > used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned > > zero-copy buffers, as well as to be passed to the existing > VOP_READ/VOP_WRITE > > calls for normal read/write operations. Another example of extension, > > xu_aio, is intended to replace uioa_t for async I/O. > > > > This new structure, xuio_t, contains the following: > > > > - the existing uio structure (embedded) as the first member > > - additional fields to support extensibility > > - a union of all the defined extensions > > > > The following uio_extflag is added to indicate that an uio structure is > > indeed an xuio_t: > > > > #define UIO_XUIO 0x004 /* Structure is xuio_t */ > > > > The following uio_extflag will be removed after uioa_t has been > converted > > to xuio_t: > > > > #define UIO_ASYNC 0x002 /* Structure is xuio_t */ > > > > The project team has commitment from the networking team to remove > > the current use of uioa_t and use the proposed extensions (CR 6880095). > > > > The definition of xuio_t is: > > > > typedef struct xuio { > > uio_t xu_uio; /* Embedded UIO structure */ > > > > /* Extended uio fields */ > > enum xuio_type xu_type; /* What kind of uio structure? */ > > > > union { > > > > /* Async I/O Support */ > > struct { > > uint32_t xu_a_state; /* state of async i/o */ > > uint32_t xu_a_state; /* state of async i/o */ > > ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ > > uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ > > void **xu_a_lppp; /* pointer into > lcur->uioa_ppp[] */ > > void *xu_a_hwst[4]; /* opaque hardware state */ > > uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked > pages */ > > } xu_aio; > > > > /* Zero Copy Support */ > > struct { > > enum uio_rw xu_zc_rw; /* the use of the buffer */ > > void *xu_zc_priv; /* fs specific */ > > } xu_zc; > > > > } xu_ext; > > } xuio_t; > > > > where xu_type is currently defined as: > > > > typedef enum xuio_type { > > UIOTYPE_ASYNCIO, > > UIOTYPE_ZEROCOPY > > } xuio_type_t; > > > > New uio extensions can be added by defining a new xuio_type_t, and > adding a > > new member to the xu_ext union. > > > > b. Requesting zero-copy buffers > > > > #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ > > fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) > > > > int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, > > caller_context_t *); > > > > This function requests buffers associated with file vp in preparation > for a > > subsequent zero copy read or write. The extended uio_t -- xuio_t is > used > > to pass the parameters and results. Only the following fields of > xuio_t are > > relevant to this call. > > > > uiozcp->xu_uio.uio_resid: used by the caller to specify the total > length > > of the buffer. > > > > uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file > offset > > it would like the buffers to be associated with. A value of -1 > > indicates that the provider returns buffers that are not > associated > > with a particular offset. These are defined to be anonymous > buffers. > > Anonymous buffers may be used for requesting a write buffer to > receive > > data over the wire, where file offset might not be handily > available. > > > > uiozcp->xu_uio.uio_iov: used by the provider to return an array of > buffers > > (in case multiple filesystem buffers have to be reserved for the > > requested length). > > > > uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number > of > > returned buffers (length of array uiop->uio_iov). > > > > Other arguments to the call include: > > > > vp: vnode pointer of the associated file. > > > > rwflag: Indicates what the buffers are to be subsequently used for. > > Expected values are UIO_READ for VOP_READ() and UIO_WRITE for > > VOP_WRITE(). > > > > Upon successful completion, the function returns 0. One or more > > buffers may be returned as referenced by uio_iov[] and uio_iovcnt > members. > > uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is > set > > to UIOTYPE_ZEROCOPY. > > > > The caller can use this returned xuio_t in a subsequent call to > VOP_READ > > or VOP_WRITE. In the case of UIO_READ buffers, the caller should > > reference the uio_iov[] buffers only after a successful VOP_READ(). > > In the case of UIO_WRITE buffers, the caller should not reference > > the uio_iov[] buffers after a successful VOP_WRITE. > > > > In the case of anonymous buffers, the caller should set the value of > > uio_loffset before such a read/write call. This should be done only in > > the case of anonymous buffers. > > > > The member xu_zc_priv of the extended uio structure for zero-copy is > > a private handle that may be used by the provider to track its buffer > > headers or any other private information that is useful to map the > > loaned iovec entries to its internal buffers. The xu_zc_priv member > > is private to the provider and should not be changed or interpreted > > in anyway by the callers. > > > > Upon failure, the function returns EINVAL error and the content > > of uiozcp should be ignored by the callers. The provider must fail the > > request if it is unable to satisfy the complete request (ie. it must > > not return buffers that cover only a part of the length that was > > asked for). > > > > Probable causes for failure include: > > > > - the filesystem is short on buffers to loan out at the time > > - the filesystem determines that it's not efficient to take the > > zero-copy path based on the input parameters > > > > c. Returning zero-copy buffers > > > > #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \ > > fop_retzcbuf(vp, uiozcp, cr, ct) > > > > int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *); > > > > This function returns the buffers previously obtained via a call > > to VOP_REQZCBUF(). In case multiple buffers are associated with the > > uio_iov[], all the buffers associated with the uiozcp are returned. > > In other words, VOP_RETZCBUF() should only be called once per xuio_t. > > The caller should not reference any of the uio_iov[] members after > > a return. > > > > d. New VFS feature attributes > > > > A new VFS feature attribute is introduced for the support of > > zero-copy interface. > > > > #define VFSFT_ZEROCOPY_SUPPORTED 0x100000100 > > > > Zero-copy is an optional feature. A filesystem supporting the > > zero-copy interface (ie. the Interface Provider) must set this > > VFS feature attribute through the VFS Feature Registration > > interface[2]. Callers of the interface (ie. Interface Consumer) > > must check the presence of support through vfs_has_feature() interface. > > The intermediate fop routines (called via the VOP_* macros) will detect > > if the interfaces are being called for a filesystem that does not > support > > zero-copy and will return ENOTSUP. > > > > INTERFACE TABLE > > > +==========================================================================================+ > > |Proposed |Specified | > > |Stability |in what | > > Interface Name |Classification |Document? | Comments > > > +==========================================================================================+ > > VOP_REQZCBUF() |Consolidation |This | New VOP calls > > fop_reqzcbuf() |Private |Document | > > VOP_RETZCBUF() | | | > > fop_retzcbuf() | | | > > | | | > > VFSFT_ZEROCOPY_SUPPORTED | | | New VFS feature > definition > > | | | > > xuio_t | | | Extended uio_t > definition > > | | | > > | | | > > uioa_t | | | Deprecated > > UIO_ASYNC | | | Deprecated > > > +==========================================================================================+ > > > > * The project's deliverables will all go into the OS/NET > > Consolidation, so no contracts are required. > > > > > > == Using the New VOP Interfaces for Zero-copy == > > > > VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with > > VOP_READ() or VOP_WRITE() to implement zero-copy read or write. > > > > a. Read > > > > In a normal read, the consumer allocates the data buffer and passes it > to > > VOP_READ(). The provider initiates the I/O, and copies the data from > its > > own cache buffer to the consumer supplied buffer. > > > > To avoid the copy (initiating a zero-copy read), the consumer first > calls > > VOP_REQZCBUF() to inform the provider to prepare to loan out its cache > > buffer. It then calls VOP_READ(). After the call returns, the > consumer > > has direct access to the cache buffer loaned out by the provider. > After > > processing the data, the consumer calls VOP_RETZCBUF() to return the > loaned > > cache buffer to the provider. > > > > Here is an illustration using NFSv4 read over TCP: > > > > rfs4_op_read(nfs_argop4 *argop, ...) > > { > > int zerocopy; > > xuio_t *xuio; > > ... > > xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); > > setup length, offset, etc; > > if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) { > > zerocopy = 0; > > allocate the data buffer the normal way; > > initialize (uio_t *)xuio; > > } else { > > /* xuio has been setup by the provider */ > > zerocopy = 1; > > } > > do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct); > > ... > > if (zerocopy) { > > setup callback mechanism that makes the network layer call > > VOP_RETZCBUF() and free xuio after the data is sent out; > > } else { > > kmem_free(xuio, sizeof(xuio_t)); > > } > > } > > > > b. Write > > > > In a normal write, the consumer allocates the data buffer, loads the > data, > > and passes the buffer to VOP_WRITE(). The provider copies the data > from > > the consumer supplied buffer to its own cache buffer, and starts the > I/O. > > > > To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() > to > > grab a cache buffer from the provider. It loads the data directly to > > the loaned cache buffer, and calls VOP_WRITE(). After the call > returns, > > the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to > > the provider. > > > > Here is an illustration using NFSv4 write via RDMA: > > > > rfs4_op_write(nfs_argop4 *argop, ...) > > { > > int zerocopy; > > xuio_t *xuio; > > ... > > xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); > > setup length, offset, etc; > > if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) { > > zerocopy = 0; > > allocate the data buffer the normal way; > > initialize (uio_t *)xuio; > > xdrrdma_read_from_client(...); > > } else { > > /* xuio has been setup by the provider */ > > zerocopy = 1; > > xdrrdma_zcopy_read_from_client(..., xuio); > > } > > do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct); > > ... > > if (zerocopy) { > > VOP_RETZCBUF(vp, xuio, cr, &ct); > > } > > kmem_free(xuio, sizeof(xuio_t)); > > } > > > > > > References: > > [1] PSARC/2003/172 File Event Monitoring > > [2] PSARC/2007/227 VFS Features > > > > > > 6. Resources and Schedule > > 6.4. Steering Committee requested information > > 6.4.1. Consolidation C-team Name: > > ON > > 6.5. ARC review type: FastTrack > > 6.6. ARC Exposure: open > > > >