Roch wrote:
> Filesystems might have some blocksize and alignment constraints
> conditioning their ability to loan up buffers (for writes). 
> If that is so, we could use an API to query the FS about
> those values. For a copy on write & variable block size
> filesystem, that natural blocksize might also depend on the
> vnode being targetted. 
Yes. The provider can fail the VOP_REQZCBUF() call if it determines
that it is inefficient to take the zero-copy path. Depending on the
provider implementation, this could be blocksize aligned. In such cases,
the consumer could use VFSNAME_STATVFS() call to determine
'f_bsize' value.  But as you note, certain implementations may have
different values for individual files. In such cases if the VOP_REQZCBUF()
fails, the consumer then uses the traditional non zero-copy path.

An additional API to find the such constraints/requirements may
be useful in future, but is out-of-scope for this project.  However, the
project team will open an RFE for this issue and put you on the
interest list.
> Do we know if ZFS will ever be able to
> loan up buffers for writes that are not aligned full records ?
>   
No, not planned currently. It has to be block size aligned.
Also note that currently, from an implementation perspective,
zero-copy WRITEs are efficient only in case network-based
filesystems like NFS over RDMA transports.

Mahesh
> -r
>
> Rich.Brown at Sun.COM writes:
>  > I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
>  > This case proposes new interfaces to support copy reduction in the I/O path
>  > especially for file sharing services.
>  > 
>  > Minor binding is requested.
>  > 
>  > This times out on Wednesday, 16 September, 2009.
>  > 
>  > 
>  > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
>  > This information is Copyright 2009 Sun Microsystems
>  > 1. Introduction
>  >     1.1. Project/Component Working Name:
>  >     Copy Reduction Interfaces
>  >     1.2. Name of Document Author/Supplier:
>  >     Author:  Mahesh Siddheshwar, Chunli Zhang
>  >     1.3  Date of This Document:
>  >    09 September, 2009
>  > 4. Technical Description
>  > 
>  >  == Introduction/Background ==
>  > 
>  >  Zero-copy (copy avoidance) is essentially buffer sharing
>  >  among multiple modules that pass data between the modules. 
>  >  This proposal avoids the data copy in the READ/WRITE path 
>  >  of filesystems, by providing a mechanism to share data buffers
>  >  between the modules. It is intended to be used by network file
>  >  sharing services like NFS, CIFS or others.
>  > 
>  >  Although the buffer sharing can be achieved through a few different
>  >  solutions, any such solution must work with File Event Monitors
>  >  (FEM monitors)[1] installed on the files. The solution must
>  >  allow the underlying filesystem to maintain any existing file 
>  >  range locking in the filesystem.
>  >  
>  >  The proposed solution provides extensions to the existing VOP
>  >  interface to request and return buffers from a filesystem. The 
>  >  buffers are then used with existing VOP_READ/VOP_WRITE calls with
>  >  minimal changes.
>  > 
>  > 
>  >  == Proposed Changes ==
>  > 
>  >  VOP Extensions for Zero-Copy Support
>  >  ========================================
>  > 
>  >  a. Extended struct uio, xuio_t
>  > 
>  >   The following proposes an extensible uio structure that can be extended 
> for
>  >   multiple purposes.  For example, an immediate extension, xu_zc, is to be 
>  >   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
>  >   zero-copy buffers, as well as to be passed to the existing 
> VOP_READ/VOP_WRITE
>  >   calls for normal read/write operations.  Another example of extension,
>  >   xu_aio, is intended to replace uioa_t for async I/O.
>  > 
>  >   This new structure, xuio_t, contains the following:
>  > 
>  >   - the existing uio structure (embedded) as the first member
>  >   - additional fields to support extensibility
>  >   - a union of all the defined extensions
>  > 
>  >   The following uio_extflag is added to indicate that an uio structure is
>  >   indeed an xuio_t:
>  > 
>  >   #define  UIO_XUIO        0x004   /* Structure is xuio_t */
>  > 
>  >   The following uio_extflag will be removed after uioa_t has been 
> converted 
>  >   to xuio_t:
>  > 
>  >   #define  UIO_ASYNC       0x002   /* Structure is xuio_t */
>  > 
>  >   The project team has commitment from the networking team to remove
>  >   the current use of uioa_t and use the proposed extensions (CR 6880095).
>  > 
>  >   The definition of xuio_t is:
>  > 
>  >   typedef struct xuio {
>  >     uio_t xu_uio;          /* Embedded UIO structure */
>  > 
>  >     /* Extended uio fields */
>  >     enum xuio_type xu_type;        /* What kind of uio structure? */
>  > 
>  >     union {
>  > 
>  >            /* Async I/O Support */
>  >            struct {
>  >             uint32_t xu_a_state;   /* state of async i/o */
>  >             uint32_t xu_a_state;   /* state of async i/o */
>  >             ssize_t xu_a_mbytes;   /* bytes that have been uioamove()ed */
>  >             uioa_page_t *xu_a_lcur;        /* pointer into uioa_locked[] */
>  >             void **xu_a_lppp;              /* pointer into 
> lcur->uioa_ppp[] */
>  >             void *xu_a_hwst[4];            /* opaque hardware state */
>  >             uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked 
> pages */
>  >            } xu_aio;
>  > 
>  >            /* Zero Copy Support */
>  >            struct {
>  >             enum uio_rw xu_zc_rw;  /* the use of the buffer */
>  >             void *xu_zc_priv;              /* fs specific */
>  >            } xu_zc;
>  > 
>  >     } xu_ext;
>  >   } xuio_t;
>  > 
>  >   where xu_type is currently defined as:
>  > 
>  >   typedef enum xuio_type {
>  >     UIOTYPE_ASYNCIO,
>  >     UIOTYPE_ZEROCOPY
>  >   } xuio_type_t;
>  > 
>  >   New uio extensions can be added by defining a new xuio_type_t, and 
> adding a
>  >   new member to the xu_ext union.
>  > 
>  >  b. Requesting zero-copy buffers
>  > 
>  >     #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
>  >     fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)
>  > 
>  >     int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
>  >    caller_context_t *);
>  >  
>  >     This function requests buffers associated with file vp in preparation 
> for a
>  >     subsequent zero copy read or write. The extended uio_t -- xuio_t is 
> used
>  >     to pass the parameters and results. Only the following fields of 
> xuio_t are
>  >     relevant to this call.
>  >  
>  >     uiozcp->xu_uio.uio_resid: used by the caller to specify the total 
> length
>  >          of the buffer.
>  > 
>  >     uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file 
> offset
>  >          it would like the buffers to be associated with. A value of -1 
>  >          indicates that the provider returns buffers that are not 
> associated
>  >          with a particular offset.  These are defined to be anonymous 
> buffers.
>  >          Anonymous buffers may be used for requesting a write buffer to 
> receive
>  >          data over the wire, where file offset might not be handily 
> available.
>  > 
>  >     uiozcp->xu_uio.uio_iov: used by the provider to return an array of 
> buffers
>  >          (in case multiple filesystem buffers have to be reserved for the
>  >          requested length).
>  > 
>  >     uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number 
> of
>  >          returned buffers (length of array uiop->uio_iov).
>  > 
>  >     Other arguments to the call include:
>  > 
>  >     vp:  vnode pointer of the associated file.
>  > 
>  >     rwflag: Indicates what the buffers are to be subsequently used for.
>  >             Expected values are UIO_READ for VOP_READ() and UIO_WRITE for
>  >             VOP_WRITE().
>  > 
>  >     Upon successful completion, the function returns 0. One or more
>  >     buffers may be returned as referenced by uio_iov[] and uio_iovcnt 
> members.
>  >     uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is 
> set
>  >     to UIOTYPE_ZEROCOPY.
>  > 
>  >     The caller can use this returned xuio_t in a subsequent call to 
> VOP_READ
>  >     or VOP_WRITE. In the case of UIO_READ buffers, the caller should
>  >     reference the uio_iov[] buffers only after a successful VOP_READ().
>  >     In the case of UIO_WRITE buffers, the caller should not reference
>  >     the uio_iov[] buffers after a successful VOP_WRITE.
>  > 
>  >     In the case of anonymous buffers, the caller should set the value of 
>  >     uio_loffset before such a read/write call. This should be done only in 
>  >     the case of anonymous buffers. 
>  > 
>  >     The member xu_zc_priv of the extended uio structure for zero-copy is 
>  >     a private handle that may be used by the provider to track its buffer
>  >     headers or any other private information that is useful to map the 
>  >     loaned iovec entries to its internal buffers. The xu_zc_priv member
>  >     is private to the provider and should not be changed or interpreted 
>  >     in anyway by the callers.
>  > 
>  >     Upon failure, the function returns EINVAL error and the content
>  >     of uiozcp should be ignored by the callers. The provider must fail the
>  >     request if it is unable to satisfy the complete request (ie. it must
>  >     not return buffers that cover only a part of the length that was
>  >     asked for).
>  > 
>  >     Probable causes for failure include:
>  > 
>  >     - the filesystem is short on buffers to loan out at the time
>  >     - the filesystem determines that it's not efficient to take the
>  >       zero-copy path based on the input parameters
>  >     
>  >  c. Returning zero-copy buffers
>  > 
>  >     #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \
>  >     fop_retzcbuf(vp, uiozcp, cr, ct)
>  > 
>  >     int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *);
>  >  
>  >     This function returns the buffers previously obtained via a call
>  >     to VOP_REQZCBUF(). In case multiple buffers are associated with the
>  >     uio_iov[], all the buffers associated with the uiozcp are returned.
>  >     In other words, VOP_RETZCBUF() should only be called once per xuio_t.
>  >     The caller should not reference any of the uio_iov[] members after
>  >     a return.
>  > 
>  >  d. New VFS feature attributes
>  > 
>  >     A new VFS feature attribute is introduced for the support of
>  >     zero-copy interface.
>  > 
>  >   #define VFSFT_ZEROCOPY_SUPPORTED     0x100000100
>  > 
>  >    Zero-copy is an optional feature. A filesystem supporting the
>  >    zero-copy interface (ie. the Interface Provider) must set this
>  >    VFS feature attribute through the VFS Feature Registration
>  >    interface[2]. Callers of the interface (ie. Interface Consumer)
>  >    must check the presence of support through vfs_has_feature() interface.
>  >    The intermediate fop routines (called via the VOP_* macros) will detect
>  >    if the interfaces are being called for a filesystem that does not 
> support
>  >    zero-copy and will return ENOTSUP.
>  > 
>  >  INTERFACE TABLE
>  >  
> +==========================================================================================+
>  >                             |Proposed       |Specified   |
>  >                             |Stability      |in what     |
>  >   Interface Name            |Classification |Document?   | Comments
>  >  
> +==========================================================================================+
>  >    VOP_REQZCBUF()           |Consolidation  |This        | New VOP calls
>  >    fop_reqzcbuf()           |Private        |Document    |
>  >    VOP_RETZCBUF()           |               |            |
>  >    fop_retzcbuf()           |               |            |
>  >                             |               |            |
>  >    VFSFT_ZEROCOPY_SUPPORTED |               |            | New VFS feature 
> definition
>  >                             |               |            |
>  >    xuio_t                   |               |            | Extended uio_t 
> definition
>  >                             |               |            |
>  >                             |               |            |
>  >    uioa_t                   |               |            | Deprecated
>  >    UIO_ASYNC                |               |            | Deprecated
>  >  
> +==========================================================================================+
>  > 
>  >  * The project's deliverables will all go into the OS/NET
>  >    Consolidation, so no contracts are required.
>  > 
>  > 
>  >  == Using the New VOP Interfaces for Zero-copy ==
>  > 
>  >  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
>  >  VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 
>  > 
>  >  a. Read
>  > 
>  >     In a normal read, the consumer allocates the data buffer and passes it 
> to
>  >     VOP_READ().  The provider initiates the I/O, and copies the data from 
> its
>  >     own cache buffer to the consumer supplied buffer.
>  > 
>  >     To avoid the copy (initiating a zero-copy read), the consumer first 
> calls
>  >     VOP_REQZCBUF() to inform the provider to prepare to loan out its cache
>  >     buffer.  It then calls VOP_READ().  After the call returns, the 
> consumer
>  >     has direct access to the cache buffer loaned out by the provider.  
> After
>  >     processing the data, the consumer calls VOP_RETZCBUF() to return the 
> loaned
>  >     cache buffer to the provider.
>  > 
>  >     Here is an illustration using NFSv4 read over TCP:
>  > 
>  >         rfs4_op_read(nfs_argop4 *argop, ...)
>  >         {
>  >             int zerocopy;
>  >             xuio_t *xuio;
>  >             ...
>  >             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
>  >             setup length, offset, etc;
>  >             if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) {
>  >                 zerocopy = 0;
>  >                 allocate the data buffer the normal way;
>  >                 initialize (uio_t *)xuio;
>  >             } else {
>  >                 /* xuio has been setup by the provider */
>  >                 zerocopy = 1;
>  >             }
>  >             do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct);
>  >             ...
>  >             if (zerocopy) {
>  >                 setup callback mechanism that makes the network layer call
>  >                 VOP_RETZCBUF() and free xuio after the data is sent out;
>  >             } else {
>  >                 kmem_free(xuio, sizeof(xuio_t));
>  >             }
>  >         }
>  > 
>  >  b. Write
>  > 
>  >     In a normal write, the consumer allocates the data buffer, loads the 
> data,
>  >     and passes the buffer to VOP_WRITE().  The provider copies the data 
> from
>  >     the consumer supplied buffer to its own cache buffer, and starts the 
> I/O.
>  > 
>  >     To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() 
> to
>  >     grab a cache buffer from the provider.  It loads the data directly to
>  >     the loaned cache buffer, and calls VOP_WRITE().  After the call 
> returns,
>  >     the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
>  >     the provider.
>  > 
>  >     Here is an illustration using NFSv4 write via RDMA:
>  > 
>  >         rfs4_op_write(nfs_argop4 *argop, ...)
>  >         {
>  >             int zerocopy;
>  >             xuio_t *xuio;
>  >             ...
>  >             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
>  >             setup length, offset, etc;
>  >             if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) {
>  >                 zerocopy = 0;
>  >                 allocate the data buffer the normal way;
>  >                 initialize (uio_t *)xuio;
>  >                 xdrrdma_read_from_client(...);
>  >             } else {
>  >                 /* xuio has been setup by the provider */
>  >                 zerocopy = 1;
>  >                 xdrrdma_zcopy_read_from_client(..., xuio);
>  >             }
>  >             do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct);
>  >             ...
>  >             if (zerocopy) {
>  >                 VOP_RETZCBUF(vp, xuio, cr, &ct);
>  >             }
>  >             kmem_free(xuio, sizeof(xuio_t));
>  >         }
>  > 
>  > 
>  >  References:
>  >   [1] PSARC/2003/172 File Event Monitoring 
>  >   [2] PSARC/2007/227 VFS Features 
>  > 
>  > 
>  > 6. Resources and Schedule
>  >     6.4. Steering Committee requested information
>  >            6.4.1. Consolidation C-team Name:
>  >            ON
>  >     6.5. ARC review type: FastTrack
>  >     6.6. ARC Exposure: open
>  > 
>
>   

Reply via email to