I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
This case proposes new interfaces to support copy reduction in the I/O path
especially for file sharing services.

Minor binding is requested.

This times out on Wednesday, 16 September, 2009.

Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
This information is Copyright 2009 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         Copy Reduction Interfaces
    1.2. Name of Document Author/Supplier:
         Author:  Mahesh Siddheshwar, Chunli Zhang
    1.3  Date of This Document:
        09 September, 2009
4. Technical Description

 == Introduction/Background ==

 Zero-copy (copy avoidance) is essentially buffer sharing
 among multiple modules that pass data between the modules. 
 This proposal avoids the data copy in the READ/WRITE path 
 of filesystems, by providing a mechanism to share data buffers
 between the modules. It is intended to be used by network file
 sharing services like NFS, CIFS or others.

 Although the buffer sharing can be achieved through a few different
 solutions, any such solution must work with File Event Monitors
 (FEM monitors)[1] installed on the files. The solution must
 allow the underlying filesystem to maintain any existing file 
 range locking in the filesystem.
 The proposed solution provides extensions to the existing VOP
 interface to request and return buffers from a filesystem. The 
 buffers are then used with existing VOP_READ/VOP_WRITE calls with
 minimal changes.

 == Proposed Changes ==

 VOP Extensions for Zero-Copy Support

 a. Extended struct uio, xuio_t

  The following proposes an extensible uio structure that can be extended for
  multiple purposes.  For example, an immediate extension, xu_zc, is to be 
  used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
  zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE
  calls for normal read/write operations.  Another example of extension,
  xu_aio, is intended to replace uioa_t for async I/O.

  This new structure, xuio_t, contains the following:

  - the existing uio structure (embedded) as the first member
  - additional fields to support extensibility
  - a union of all the defined extensions

  The following uio_extflag is added to indicate that an uio structure is
  indeed an xuio_t:

  #define       UIO_XUIO        0x004   /* Structure is xuio_t */

  The following uio_extflag will be removed after uioa_t has been converted 
  to xuio_t:

  #define       UIO_ASYNC       0x002   /* Structure is xuio_t */

  The project team has commitment from the networking team to remove
  the current use of uioa_t and use the proposed extensions (CR 6880095).

  The definition of xuio_t is:

  typedef struct xuio {
    uio_t xu_uio;               /* Embedded UIO structure */

    /* Extended uio fields */
    enum xuio_type xu_type;     /* What kind of uio structure? */

    union {

        /* Async I/O Support */
        struct {
            uint32_t xu_a_state;        /* state of async i/o */
            uint32_t xu_a_state;        /* state of async i/o */
            ssize_t xu_a_mbytes;        /* bytes that have been uioamove()ed */
            uioa_page_t *xu_a_lcur;     /* pointer into uioa_locked[] */
            void **xu_a_lppp;           /* pointer into lcur->uioa_ppp[] */
            void *xu_a_hwst[4];         /* opaque hardware state */
            uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages */
        } xu_aio;

        /* Zero Copy Support */
        struct {
            enum uio_rw xu_zc_rw;       /* the use of the buffer */
            void *xu_zc_priv;           /* fs specific */
        } xu_zc;

    } xu_ext;
  } xuio_t;

  where xu_type is currently defined as:

  typedef enum xuio_type {
  } xuio_type_t;

  New uio extensions can be added by defining a new xuio_type_t, and adding a
  new member to the xu_ext union.

 b. Requesting zero-copy buffers

    #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
    fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

    int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
        caller_context_t *);
    This function requests buffers associated with file vp in preparation for a
    subsequent zero copy read or write. The extended uio_t -- xuio_t is used
    to pass the parameters and results. Only the following fields of xuio_t are
    relevant to this call.
    uiozcp->xu_uio.uio_resid: used by the caller to specify the total length
         of the buffer.

    uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the file offset
         it would like the buffers to be associated with. A value of -1 
         indicates that the provider returns buffers that are not associated
         with a particular offset.  These are defined to be anonymous buffers.
         Anonymous buffers may be used for requesting a write buffer to receive
         data over the wire, where file offset might not be handily available.

    uiozcp->xu_uio.uio_iov: used by the provider to return an array of buffers
         (in case multiple filesystem buffers have to be reserved for the
         requested length).

    uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the number of
         returned buffers (length of array uiop->uio_iov).

    Other arguments to the call include:

    vp:  vnode pointer of the associated file.

    rwflag: Indicates what the buffers are to be subsequently used for.
            Expected values are UIO_READ for VOP_READ() and UIO_WRITE for

    Upon successful completion, the function returns 0. One or more
    buffers may be returned as referenced by uio_iov[] and uio_iovcnt members.
    uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio is set

    The caller can use this returned xuio_t in a subsequent call to VOP_READ
    or VOP_WRITE. In the case of UIO_READ buffers, the caller should
    reference the uio_iov[] buffers only after a successful VOP_READ().
    In the case of UIO_WRITE buffers, the caller should not reference
    the uio_iov[] buffers after a successful VOP_WRITE.

    In the case of anonymous buffers, the caller should set the value of 
    uio_loffset before such a read/write call. This should be done only in 
    the case of anonymous buffers. 

    The member xu_zc_priv of the extended uio structure for zero-copy is 
    a private handle that may be used by the provider to track its buffer
    headers or any other private information that is useful to map the 
    loaned iovec entries to its internal buffers. The xu_zc_priv member
    is private to the provider and should not be changed or interpreted 
    in anyway by the callers.

    Upon failure, the function returns EINVAL error and the content
    of uiozcp should be ignored by the callers. The provider must fail the
    request if it is unable to satisfy the complete request (ie. it must
    not return buffers that cover only a part of the length that was
    asked for).

    Probable causes for failure include:

    - the filesystem is short on buffers to loan out at the time
    - the filesystem determines that it's not efficient to take the
      zero-copy path based on the input parameters
 c. Returning zero-copy buffers

    #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \
    fop_retzcbuf(vp, uiozcp, cr, ct)

    int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *);
    This function returns the buffers previously obtained via a call
    to VOP_REQZCBUF(). In case multiple buffers are associated with the
    uio_iov[], all the buffers associated with the uiozcp are returned.
    In other words, VOP_RETZCBUF() should only be called once per xuio_t.
    The caller should not reference any of the uio_iov[] members after
    a return.

 d. New VFS feature attributes

    A new VFS feature attribute is introduced for the support of
    zero-copy interface.

  #define VFSFT_ZEROCOPY_SUPPORTED     0x100000100

   Zero-copy is an optional feature. A filesystem supporting the
   zero-copy interface (ie. the Interface Provider) must set this
   VFS feature attribute through the VFS Feature Registration
   interface[2]. Callers of the interface (ie. Interface Consumer)
   must check the presence of support through vfs_has_feature() interface.
   The intermediate fop routines (called via the VOP_* macros) will detect
   if the interfaces are being called for a filesystem that does not support
   zero-copy and will return ENOTSUP.

                            |Proposed       |Specified   |
                            |Stability      |in what     |
  Interface Name            |Classification |Document?   | Comments
   VOP_REQZCBUF()           |Consolidation  |This        | New VOP calls
   fop_reqzcbuf()           |Private        |Document    |
   VOP_RETZCBUF()           |               |            |
   fop_retzcbuf()           |               |            |
                            |               |            |
   VFSFT_ZEROCOPY_SUPPORTED |               |            | New VFS feature 
                            |               |            |
   xuio_t                   |               |            | Extended uio_t 
                            |               |            |
                            |               |            |
   uioa_t                   |               |            | Deprecated
   UIO_ASYNC                |               |            | Deprecated

 * The project's deliverables will all go into the OS/NET
   Consolidation, so no contracts are required.

 == Using the New VOP Interfaces for Zero-copy ==

 VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
 VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 

 a. Read

    In a normal read, the consumer allocates the data buffer and passes it to
    VOP_READ().  The provider initiates the I/O, and copies the data from its
    own cache buffer to the consumer supplied buffer.

    To avoid the copy (initiating a zero-copy read), the consumer first calls
    VOP_REQZCBUF() to inform the provider to prepare to loan out its cache
    buffer.  It then calls VOP_READ().  After the call returns, the consumer
    has direct access to the cache buffer loaned out by the provider.  After
    processing the data, the consumer calls VOP_RETZCBUF() to return the loaned
    cache buffer to the provider.

    Here is an illustration using NFSv4 read over TCP:

        rfs4_op_read(nfs_argop4 *argop, ...)
            int zerocopy;
            xuio_t *xuio;
            xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
            setup length, offset, etc;
            if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) {
                zerocopy = 0;
                allocate the data buffer the normal way;
                initialize (uio_t *)xuio;
            } else {
                /* xuio has been setup by the provider */
                zerocopy = 1;
            do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct);
            if (zerocopy) {
                setup callback mechanism that makes the network layer call
                VOP_RETZCBUF() and free xuio after the data is sent out;
            } else {
                kmem_free(xuio, sizeof(xuio_t));

 b. Write

    In a normal write, the consumer allocates the data buffer, loads the data,
    and passes the buffer to VOP_WRITE().  The provider copies the data from
    the consumer supplied buffer to its own cache buffer, and starts the I/O.

    To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to
    grab a cache buffer from the provider.  It loads the data directly to
    the loaned cache buffer, and calls VOP_WRITE().  After the call returns,
    the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
    the provider.

    Here is an illustration using NFSv4 write via RDMA:

        rfs4_op_write(nfs_argop4 *argop, ...)
            int zerocopy;
            xuio_t *xuio;
            xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
            setup length, offset, etc;
            if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) {
                zerocopy = 0;
                allocate the data buffer the normal way;
                initialize (uio_t *)xuio;
            } else {
                /* xuio has been setup by the provider */
                zerocopy = 1;
                xdrrdma_zcopy_read_from_client(..., xuio);
            do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct);
            if (zerocopy) {
                VOP_RETZCBUF(vp, xuio, cr, &ct);
            kmem_free(xuio, sizeof(xuio_t));

  [1] PSARC/2003/172 File Event Monitoring 
  [2] PSARC/2007/227 VFS Features 

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open

Reply via email to