I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services.
Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support ======================================== a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #define UIO_XUIO 0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #define UIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio; /* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type; /* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state; /* state of async i/o */ uint32_t xu_a_state; /* state of async i/o */ ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ void **xu_a_lppp; /* pointer into lcur->uioa_ppp[] */ void *xu_a_hwst[4]; /* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw; /* the use of the buffer */ void *xu_zc_priv; /* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers associated with file vp in preparation for a subsequent zero copy read or write. The extended uio_t -- xuio_t is used to pass the parameters and results. Only the following fields of xuio_t are relevant to this call. uiozcp->xu_uio.uio_resid: used by the caller to specify the total length of the buffer. uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file offset it would like the buffers to be associated with. A value of -1 indicates that the provider returns buffers that are not associated with a particular offset. These are defined to be anonymous buffers. Anonymous buffers may be used for requesting a write buffer to receive data over the wire, where file offset might not be handily available. uiozcp->xu_uio.uio_iov: used by the provider to return an array of buffers (in case multiple filesystem buffers have to be reserved for the requested length). uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number of returned buffers (length of array uiop->uio_iov). Other arguments to the call include: vp: vnode pointer of the associated file. rwflag: Indicates what the buffers are to be subsequently used for. Expected values are UIO_READ for VOP_READ() and UIO_WRITE for VOP_WRITE(). Upon successful completion, the function returns 0. One or more buffers may be returned as referenced by uio_iov[] and uio_iovcnt members. uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is set to UIOTYPE_ZEROCOPY. The caller can use this returned xuio_t in a subsequent call to VOP_READ or VOP_WRITE. In the case of UIO_READ buffers, the caller should reference the uio_iov[] buffers only after a successful VOP_READ(). In the case of UIO_WRITE buffers, the caller should not reference the uio_iov[] buffers after a successful VOP_WRITE. In the case of anonymous buffers, the caller should set the value of uio_loffset before such a read/write call. This should be done only in the case of anonymous buffers. The member xu_zc_priv of the extended uio structure for zero-copy is a private handle that may be used by the provider to track its buffer headers or any other private information that is useful to map the loaned iovec entries to its internal buffers. The xu_zc_priv member is private to the provider and should not be changed or interpreted in anyway by the callers. Upon failure, the function returns EINVAL error and the content of uiozcp should be ignored by the callers. The provider must fail the request if it is unable to satisfy the complete request (ie. it must not return buffers that cover only a part of the length that was asked for). Probable causes for failure include: - the filesystem is short on buffers to loan out at the time - the filesystem determines that it's not efficient to take the zero-copy path based on the input parameters c. Returning zero-copy buffers #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \ fop_retzcbuf(vp, uiozcp, cr, ct) int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *); This function returns the buffers previously obtained via a call to VOP_REQZCBUF(). In case multiple buffers are associated with the uio_iov[], all the buffers associated with the uiozcp are returned. In other words, VOP_RETZCBUF() should only be called once per xuio_t. The caller should not reference any of the uio_iov[] members after a return. d. New VFS feature attributes A new VFS feature attribute is introduced for the support of zero-copy interface. #define VFSFT_ZEROCOPY_SUPPORTED 0x100000100 Zero-copy is an optional feature. A filesystem supporting the zero-copy interface (ie. the Interface Provider) must set this VFS feature attribute through the VFS Feature Registration interface[2]. Callers of the interface (ie. Interface Consumer) must check the presence of support through vfs_has_feature() interface. The intermediate fop routines (called via the VOP_* macros) will detect if the interfaces are being called for a filesystem that does not support zero-copy and will return ENOTSUP. INTERFACE TABLE +==========================================================================================+ |Proposed |Specified | |Stability |in what | Interface Name |Classification |Document? | Comments +==========================================================================================+ VOP_REQZCBUF() |Consolidation |This | New VOP calls fop_reqzcbuf() |Private |Document | VOP_RETZCBUF() | | | fop_retzcbuf() | | | | | | VFSFT_ZEROCOPY_SUPPORTED | | | New VFS feature definition | | | xuio_t | | | Extended uio_t definition | | | | | | uioa_t | | | Deprecated UIO_ASYNC | | | Deprecated +==========================================================================================+ * The project's deliverables will all go into the OS/NET Consolidation, so no contracts are required. == Using the New VOP Interfaces for Zero-copy == VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with VOP_READ() or VOP_WRITE() to implement zero-copy read or write. a. Read In a normal read, the consumer allocates the data buffer and passes it to VOP_READ(). The provider initiates the I/O, and copies the data from its own cache buffer to the consumer supplied buffer. To avoid the copy (initiating a zero-copy read), the consumer first calls VOP_REQZCBUF() to inform the provider to prepare to loan out its cache buffer. It then calls VOP_READ(). After the call returns, the consumer has direct access to the cache buffer loaned out by the provider. After processing the data, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. Here is an illustration using NFSv4 read over TCP: rfs4_op_read(nfs_argop4 *argop, ...) { int zerocopy; xuio_t *xuio; ... xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); setup length, offset, etc; if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) { zerocopy = 0; allocate the data buffer the normal way; initialize (uio_t *)xuio; } else { /* xuio has been setup by the provider */ zerocopy = 1; } do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct); ... if (zerocopy) { setup callback mechanism that makes the network layer call VOP_RETZCBUF() and free xuio after the data is sent out; } else { kmem_free(xuio, sizeof(xuio_t)); } } b. Write In a normal write, the consumer allocates the data buffer, loads the data, and passes the buffer to VOP_WRITE(). The provider copies the data from the consumer supplied buffer to its own cache buffer, and starts the I/O. To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to grab a cache buffer from the provider. It loads the data directly to the loaned cache buffer, and calls VOP_WRITE(). After the call returns, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. Here is an illustration using NFSv4 write via RDMA: rfs4_op_write(nfs_argop4 *argop, ...) { int zerocopy; xuio_t *xuio; ... xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); setup length, offset, etc; if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) { zerocopy = 0; allocate the data buffer the normal way; initialize (uio_t *)xuio; xdrrdma_read_from_client(...); } else { /* xuio has been setup by the provider */ zerocopy = 1; xdrrdma_zcopy_read_from_client(..., xuio); } do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct); ... if (zerocopy) { VOP_RETZCBUF(vp, xuio, cr, &ct); } kmem_free(xuio, sizeof(xuio_t)); } References: [1] PSARC/2003/172 File Event Monitoring [2] PSARC/2007/227 VFS Features 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open