Rick Matthews wrote: > Are there instances where an assigned zero-copy buffer could be orphaned? No. The consumer Must release the buffers through VOP_RETZCBUF().
Mahesh > If so, should there be a recovery list associated with this addition? > Perhaps off > the designated vnode. > > This comment shouldn't block fast-track approval. Just a question. > -- > Rick > > On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote: >> I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli >> Zhang. >> This case proposes new interfaces to support copy reduction in the >> I/O path >> especially for file sharing services. >> >> Minor binding is requested. >> >> This times out on Wednesday, 16 September, 2009. >> >> >> Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI >> This information is Copyright 2009 Sun Microsystems >> 1. Introduction >> 1.1. Project/Component Working Name: >> Copy Reduction Interfaces >> 1.2. Name of Document Author/Supplier: >> Author: Mahesh Siddheshwar, Chunli Zhang >> 1.3 Date of This Document: >> 09 September, 2009 >> 4. Technical Description >> >> == Introduction/Background == >> >> Zero-copy (copy avoidance) is essentially buffer sharing >> among multiple modules that pass data between the modules. This >> proposal avoids the data copy in the READ/WRITE path of filesystems, >> by providing a mechanism to share data buffers >> between the modules. It is intended to be used by network file >> sharing services like NFS, CIFS or others. >> >> Although the buffer sharing can be achieved through a few different >> solutions, any such solution must work with File Event Monitors >> (FEM monitors)[1] installed on the files. The solution must >> allow the underlying filesystem to maintain any existing file range >> locking in the filesystem. >> >> The proposed solution provides extensions to the existing VOP >> interface to request and return buffers from a filesystem. The >> buffers are then used with existing VOP_READ/VOP_WRITE calls with >> minimal changes. >> >> >> == Proposed Changes == >> >> VOP Extensions for Zero-Copy Support >> ======================================== >> >> a. Extended struct uio, xuio_t >> >> The following proposes an extensible uio structure that can be >> extended for >> multiple purposes. For example, an immediate extension, xu_zc, is >> to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to >> pass loaned >> zero-copy buffers, as well as to be passed to the existing >> VOP_READ/VOP_WRITE >> calls for normal read/write operations. Another example of extension, >> xu_aio, is intended to replace uioa_t for async I/O. >> >> This new structure, xuio_t, contains the following: >> >> - the existing uio structure (embedded) as the first member >> - additional fields to support extensibility >> - a union of all the defined extensions >> >> The following uio_extflag is added to indicate that an uio >> structure is >> indeed an xuio_t: >> >> #define UIO_XUIO 0x004 /* Structure is xuio_t */ >> >> The following uio_extflag will be removed after uioa_t has been >> converted to xuio_t: >> >> #define UIO_ASYNC 0x002 /* Structure is xuio_t */ >> >> The project team has commitment from the networking team to remove >> the current use of uioa_t and use the proposed extensions (CR >> 6880095). >> >> The definition of xuio_t is: >> >> typedef struct xuio { >> uio_t xu_uio; /* Embedded UIO structure */ >> >> /* Extended uio fields */ >> enum xuio_type xu_type; /* What kind of uio structure? */ >> >> union { >> >> /* Async I/O Support */ >> struct { >> uint32_t xu_a_state; /* state of async i/o */ >> uint32_t xu_a_state; /* state of async i/o */ >> ssize_t xu_a_mbytes; /* bytes that have been >> uioamove()ed */ >> uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ >> void **xu_a_lppp; /* pointer into lcur->uioa_ppp[] */ >> void *xu_a_hwst[4]; /* opaque hardware state */ >> uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov >> locked pages */ >> } xu_aio; >> >> /* Zero Copy Support */ >> struct { >> enum uio_rw xu_zc_rw; /* the use of the buffer */ >> void *xu_zc_priv; /* fs specific */ >> } xu_zc; >> >> } xu_ext; >> } xuio_t; >> >> where xu_type is currently defined as: >> >> typedef enum xuio_type { >> UIOTYPE_ASYNCIO, >> UIOTYPE_ZEROCOPY >> } xuio_type_t; >> >> New uio extensions can be added by defining a new xuio_type_t, and >> adding a >> new member to the xu_ext union. >> >> b. Requesting zero-copy buffers >> >> #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ >> fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) >> >> int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, >> caller_context_t *); >> >> This function requests buffers associated with file vp in >> preparation for a >> subsequent zero copy read or write. The extended uio_t -- xuio_t >> is used >> to pass the parameters and results. Only the following fields of >> xuio_t are >> relevant to this call. >> >> uiozcp->xu_uio.uio_resid: used by the caller to specify the total >> length >> of the buffer. >> >> uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the >> file offset >> it would like the buffers to be associated with. A value of >> -1 indicates that the provider returns buffers that are not >> associated >> with a particular offset. These are defined to be anonymous >> buffers. >> Anonymous buffers may be used for requesting a write buffer >> to receive >> data over the wire, where file offset might not be handily >> available. >> >> uiozcp->xu_uio.uio_iov: used by the provider to return an array >> of buffers >> (in case multiple filesystem buffers have to be reserved for >> the >> requested length). >> >> uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the >> number of >> returned buffers (length of array uiop->uio_iov). >> >> Other arguments to the call include: >> >> vp: vnode pointer of the associated file. >> >> rwflag: Indicates what the buffers are to be subsequently used for. >> Expected values are UIO_READ for VOP_READ() and UIO_WRITE >> for >> VOP_WRITE(). >> >> Upon successful completion, the function returns 0. One or more >> buffers may be returned as referenced by uio_iov[] and uio_iovcnt >> members. >> uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio >> is set >> to UIOTYPE_ZEROCOPY. >> >> The caller can use this returned xuio_t in a subsequent call to >> VOP_READ >> or VOP_WRITE. In the case of UIO_READ buffers, the caller should >> reference the uio_iov[] buffers only after a successful VOP_READ(). >> In the case of UIO_WRITE buffers, the caller should not reference >> the uio_iov[] buffers after a successful VOP_WRITE. >> >> In the case of anonymous buffers, the caller should set the value >> of uio_loffset before such a read/write call. This should be done >> only in the case of anonymous buffers. >> The member xu_zc_priv of the extended uio structure for zero-copy >> is a private handle that may be used by the provider to track its >> buffer >> headers or any other private information that is useful to map >> the loaned iovec entries to its internal buffers. The xu_zc_priv >> member >> is private to the provider and should not be changed or >> interpreted in anyway by the callers. >> >> Upon failure, the function returns EINVAL error and the content >> of uiozcp should be ignored by the callers. The provider must >> fail the >> request if it is unable to satisfy the complete request (ie. it must >> not return buffers that cover only a part of the length that was >> asked for). >> >> Probable causes for failure include: >> >> - the filesystem is short on buffers to loan out at the time >> - the filesystem determines that it's not efficient to take the >> zero-copy path based on the input parameters >> c. Returning zero-copy buffers >> >> #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \ >> fop_retzcbuf(vp, uiozcp, cr, ct) >> >> int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *); >> >> This function returns the buffers previously obtained via a call >> to VOP_REQZCBUF(). In case multiple buffers are associated with the >> uio_iov[], all the buffers associated with the uiozcp are returned. >> In other words, VOP_RETZCBUF() should only be called once per >> xuio_t. >> The caller should not reference any of the uio_iov[] members after >> a return. >> >> d. New VFS feature attributes >> >> A new VFS feature attribute is introduced for the support of >> zero-copy interface. >> >> #define VFSFT_ZEROCOPY_SUPPORTED 0x100000100 >> >> Zero-copy is an optional feature. A filesystem supporting the >> zero-copy interface (ie. the Interface Provider) must set this >> VFS feature attribute through the VFS Feature Registration >> interface[2]. Callers of the interface (ie. Interface Consumer) >> must check the presence of support through vfs_has_feature() >> interface. >> The intermediate fop routines (called via the VOP_* macros) will >> detect >> if the interfaces are being called for a filesystem that does not >> support >> zero-copy and will return ENOTSUP. >> >> INTERFACE TABLE >> >> +==========================================================================================+ >> >> >> |Proposed |Specified | >> |Stability |in what | >> Interface Name |Classification |Document? | Comments >> >> +==========================================================================================+ >> >> >> VOP_REQZCBUF() |Consolidation |This | New VOP calls >> fop_reqzcbuf() |Private |Document | >> VOP_RETZCBUF() | | | >> fop_retzcbuf() | | | >> | | | >> VFSFT_ZEROCOPY_SUPPORTED | | | New VFS >> feature definition >> | | | >> xuio_t | | | Extended >> uio_t definition >> | | | >> | | | >> uioa_t | | | Deprecated >> UIO_ASYNC | | | Deprecated >> >> +==========================================================================================+ >> >> >> >> * The project's deliverables will all go into the OS/NET >> Consolidation, so no contracts are required. >> >> >> == Using the New VOP Interfaces for Zero-copy == >> >> VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction >> with >> VOP_READ() or VOP_WRITE() to implement zero-copy read or write. >> a. Read >> >> In a normal read, the consumer allocates the data buffer and >> passes it to >> VOP_READ(). The provider initiates the I/O, and copies the data >> from its >> own cache buffer to the consumer supplied buffer. >> >> To avoid the copy (initiating a zero-copy read), the consumer >> first calls >> VOP_REQZCBUF() to inform the provider to prepare to loan out its >> cache >> buffer. It then calls VOP_READ(). After the call returns, the >> consumer >> has direct access to the cache buffer loaned out by the >> provider. After >> processing the data, the consumer calls VOP_RETZCBUF() to return >> the loaned >> cache buffer to the provider. >> >> Here is an illustration using NFSv4 read over TCP: >> >> rfs4_op_read(nfs_argop4 *argop, ...) >> { >> int zerocopy; >> xuio_t *xuio; >> ... >> xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); >> setup length, offset, etc; >> if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) { >> zerocopy = 0; >> allocate the data buffer the normal way; >> initialize (uio_t *)xuio; >> } else { >> /* xuio has been setup by the provider */ >> zerocopy = 1; >> } >> do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct); >> ... >> if (zerocopy) { >> setup callback mechanism that makes the network layer >> call >> VOP_RETZCBUF() and free xuio after the data is sent out; >> } else { >> kmem_free(xuio, sizeof(xuio_t)); >> } >> } >> >> b. Write >> >> In a normal write, the consumer allocates the data buffer, loads >> the data, >> and passes the buffer to VOP_WRITE(). The provider copies the >> data from >> the consumer supplied buffer to its own cache buffer, and starts >> the I/O. >> >> To initiate a zero-copy write, the consumer first calls >> VOP_REQZCBUF() to >> grab a cache buffer from the provider. It loads the data >> directly to >> the loaned cache buffer, and calls VOP_WRITE(). After the call >> returns, >> the consumer calls VOP_RETZCBUF() to return the loaned cache >> buffer to >> the provider. >> >> Here is an illustration using NFSv4 write via RDMA: >> >> rfs4_op_write(nfs_argop4 *argop, ...) >> { >> int zerocopy; >> xuio_t *xuio; >> ... >> xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP); >> setup length, offset, etc; >> if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) { >> zerocopy = 0; >> allocate the data buffer the normal way; >> initialize (uio_t *)xuio; >> xdrrdma_read_from_client(...); >> } else { >> /* xuio has been setup by the provider */ >> zerocopy = 1; >> xdrrdma_zcopy_read_from_client(..., xuio); >> } >> do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct); >> ... >> if (zerocopy) { >> VOP_RETZCBUF(vp, xuio, cr, &ct); >> } >> kmem_free(xuio, sizeof(xuio_t)); >> } >> >> >> References: >> [1] PSARC/2003/172 File Event Monitoring [2] PSARC/2007/227 VFS >> Features >> >> 6. Resources and Schedule >> 6.4. Steering Committee requested information >> 6.4.1. Consolidation C-team Name: >> ON >> 6.5. ARC review type: FastTrack >> 6.6. ARC Exposure: open >> >> > >