Rick Matthews wrote:
> Are there instances where an assigned zero-copy buffer could be orphaned?
No. The consumer Must release the buffers through VOP_RETZCBUF().

Mahesh

> If so, should there be a recovery list associated with this addition? 
> Perhaps off
> the designated vnode.
>
> This comment shouldn't block fast-track approval. Just a question.
> -- 
> Rick
>
> On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote:
>> I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli 
>> Zhang.
>> This case proposes new interfaces to support copy reduction in the 
>> I/O path
>> especially for file sharing services.
>>
>> Minor binding is requested.
>>
>> This times out on Wednesday, 16 September, 2009.
>>
>>
>> Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
>> This information is Copyright 2009 Sun Microsystems
>> 1. Introduction
>>     1.1. Project/Component Working Name:
>>      Copy Reduction Interfaces
>>     1.2. Name of Document Author/Supplier:
>>      Author:  Mahesh Siddheshwar, Chunli Zhang
>>     1.3  Date of This Document:
>>     09 September, 2009
>> 4. Technical Description
>>
>>  == Introduction/Background ==
>>
>>  Zero-copy (copy avoidance) is essentially buffer sharing
>>  among multiple modules that pass data between the modules.  This 
>> proposal avoids the data copy in the READ/WRITE path  of filesystems, 
>> by providing a mechanism to share data buffers
>>  between the modules. It is intended to be used by network file
>>  sharing services like NFS, CIFS or others.
>>
>>  Although the buffer sharing can be achieved through a few different
>>  solutions, any such solution must work with File Event Monitors
>>  (FEM monitors)[1] installed on the files. The solution must
>>  allow the underlying filesystem to maintain any existing file  range 
>> locking in the filesystem.
>>  
>>  The proposed solution provides extensions to the existing VOP
>>  interface to request and return buffers from a filesystem. The 
>>  buffers are then used with existing VOP_READ/VOP_WRITE calls with
>>  minimal changes.
>>
>>
>>  == Proposed Changes ==
>>
>>  VOP Extensions for Zero-Copy Support
>>  ========================================
>>
>>  a. Extended struct uio, xuio_t
>>
>>   The following proposes an extensible uio structure that can be 
>> extended for
>>   multiple purposes.  For example, an immediate extension, xu_zc, is 
>> to be   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to 
>> pass loaned
>>   zero-copy buffers, as well as to be passed to the existing 
>> VOP_READ/VOP_WRITE
>>   calls for normal read/write operations.  Another example of extension,
>>   xu_aio, is intended to replace uioa_t for async I/O.
>>
>>   This new structure, xuio_t, contains the following:
>>
>>   - the existing uio structure (embedded) as the first member
>>   - additional fields to support extensibility
>>   - a union of all the defined extensions
>>
>>   The following uio_extflag is added to indicate that an uio 
>> structure is
>>   indeed an xuio_t:
>>
>>   #define    UIO_XUIO    0x004    /* Structure is xuio_t */
>>
>>   The following uio_extflag will be removed after uioa_t has been 
>> converted   to xuio_t:
>>
>>   #define    UIO_ASYNC    0x002    /* Structure is xuio_t */
>>
>>   The project team has commitment from the networking team to remove
>>   the current use of uioa_t and use the proposed extensions (CR 
>> 6880095).
>>
>>   The definition of xuio_t is:
>>
>>   typedef struct xuio {
>>     uio_t xu_uio;        /* Embedded UIO structure */
>>
>>     /* Extended uio fields */
>>     enum xuio_type xu_type;    /* What kind of uio structure? */
>>
>>     union {
>>
>>         /* Async I/O Support */
>>         struct {
>>             uint32_t xu_a_state;    /* state of async i/o */
>>             uint32_t xu_a_state;    /* state of async i/o */
>>             ssize_t xu_a_mbytes;    /* bytes that have been 
>> uioamove()ed */
>>             uioa_page_t *xu_a_lcur;    /* pointer into uioa_locked[] */
>>             void **xu_a_lppp;        /* pointer into lcur->uioa_ppp[] */
>>             void *xu_a_hwst[4];        /* opaque hardware state */
>>             uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov 
>> locked pages */
>>         } xu_aio;
>>
>>         /* Zero Copy Support */
>>         struct {
>>             enum uio_rw xu_zc_rw;    /* the use of the buffer */
>>             void *xu_zc_priv;        /* fs specific */
>>         } xu_zc;
>>
>>     } xu_ext;
>>   } xuio_t;
>>
>>   where xu_type is currently defined as:
>>
>>   typedef enum xuio_type {
>>     UIOTYPE_ASYNCIO,
>>     UIOTYPE_ZEROCOPY
>>   } xuio_type_t;
>>
>>   New uio extensions can be added by defining a new xuio_type_t, and 
>> adding a
>>   new member to the xu_ext union.
>>
>>  b. Requesting zero-copy buffers
>>
>>     #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
>>     fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)
>>
>>     int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
>>     caller_context_t *);
>>  
>>     This function requests buffers associated with file vp in 
>> preparation for a
>>     subsequent zero copy read or write. The extended uio_t -- xuio_t 
>> is used
>>     to pass the parameters and results. Only the following fields of 
>> xuio_t are
>>     relevant to this call.
>>  
>>     uiozcp->xu_uio.uio_resid: used by the caller to specify the total 
>> length
>>          of the buffer.
>>
>>     uiozcp->xu_uio.uio_loffset: Used by the caller to indicate the 
>> file offset
>>          it would like the buffers to be associated with. A value of 
>> -1          indicates that the provider returns buffers that are not 
>> associated
>>          with a particular offset.  These are defined to be anonymous 
>> buffers.
>>          Anonymous buffers may be used for requesting a write buffer 
>> to receive
>>          data over the wire, where file offset might not be handily 
>> available.
>>
>>     uiozcp->xu_uio.uio_iov: used by the provider to return an array 
>> of buffers
>>          (in case multiple filesystem buffers have to be reserved for 
>> the
>>          requested length).
>>
>>     uiozcp->xu_uio.uio_iovcnt: used by the provider to indicate the 
>> number of
>>          returned buffers (length of array uiop->uio_iov).
>>
>>     Other arguments to the call include:
>>
>>     vp:  vnode pointer of the associated file.
>>
>>     rwflag: Indicates what the buffers are to be subsequently used for.
>>             Expected values are UIO_READ for VOP_READ() and UIO_WRITE 
>> for
>>             VOP_WRITE().
>>
>>     Upon successful completion, the function returns 0. One or more
>>     buffers may be returned as referenced by uio_iov[] and uio_iovcnt 
>> members.
>>     uiozcp->xu_uio.uio_extflag is set to UIO_XUIO, and uiozcp->xu_uio 
>> is set
>>     to UIOTYPE_ZEROCOPY.
>>
>>     The caller can use this returned xuio_t in a subsequent call to 
>> VOP_READ
>>     or VOP_WRITE. In the case of UIO_READ buffers, the caller should
>>     reference the uio_iov[] buffers only after a successful VOP_READ().
>>     In the case of UIO_WRITE buffers, the caller should not reference
>>     the uio_iov[] buffers after a successful VOP_WRITE.
>>
>>     In the case of anonymous buffers, the caller should set the value 
>> of     uio_loffset before such a read/write call. This should be done 
>> only in     the case of anonymous buffers.
>>     The member xu_zc_priv of the extended uio structure for zero-copy 
>> is     a private handle that may be used by the provider to track its 
>> buffer
>>     headers or any other private information that is useful to map 
>> the     loaned iovec entries to its internal buffers. The xu_zc_priv 
>> member
>>     is private to the provider and should not be changed or 
>> interpreted     in anyway by the callers.
>>
>>     Upon failure, the function returns EINVAL error and the content
>>     of uiozcp should be ignored by the callers. The provider must 
>> fail the
>>     request if it is unable to satisfy the complete request (ie. it must
>>     not return buffers that cover only a part of the length that was
>>     asked for).
>>
>>     Probable causes for failure include:
>>
>>     - the filesystem is short on buffers to loan out at the time
>>     - the filesystem determines that it's not efficient to take the
>>       zero-copy path based on the input parameters
>>      c. Returning zero-copy buffers
>>
>>     #define VOP_RETZCBUF(vp, uiozcp, cr, ct) \
>>     fop_retzcbuf(vp, uiozcp, cr, ct)
>>
>>     int fop_retzcbuf(vnode_t *, xuio_t *, cred_t *, caller_context_t *);
>>  
>>     This function returns the buffers previously obtained via a call
>>     to VOP_REQZCBUF(). In case multiple buffers are associated with the
>>     uio_iov[], all the buffers associated with the uiozcp are returned.
>>     In other words, VOP_RETZCBUF() should only be called once per 
>> xuio_t.
>>     The caller should not reference any of the uio_iov[] members after
>>     a return.
>>
>>  d. New VFS feature attributes
>>
>>     A new VFS feature attribute is introduced for the support of
>>     zero-copy interface.
>>
>>   #define VFSFT_ZEROCOPY_SUPPORTED     0x100000100
>>
>>    Zero-copy is an optional feature. A filesystem supporting the
>>    zero-copy interface (ie. the Interface Provider) must set this
>>    VFS feature attribute through the VFS Feature Registration
>>    interface[2]. Callers of the interface (ie. Interface Consumer)
>>    must check the presence of support through vfs_has_feature() 
>> interface.
>>    The intermediate fop routines (called via the VOP_* macros) will 
>> detect
>>    if the interfaces are being called for a filesystem that does not 
>> support
>>    zero-copy and will return ENOTSUP.
>>
>>  INTERFACE TABLE
>>  
>> +==========================================================================================+
>>  
>>
>>                             |Proposed       |Specified   |
>>                             |Stability      |in what     |
>>   Interface Name            |Classification |Document?   | Comments
>>  
>> +==========================================================================================+
>>  
>>
>>    VOP_REQZCBUF()           |Consolidation  |This        | New VOP calls
>>    fop_reqzcbuf()           |Private        |Document    |
>>    VOP_RETZCBUF()           |               |            |
>>    fop_retzcbuf()           |               |            |
>>                             |               |            |
>>    VFSFT_ZEROCOPY_SUPPORTED |               |            | New VFS 
>> feature definition
>>                             |               |            |
>>    xuio_t                   |               |            | Extended 
>> uio_t definition
>>                             |               |            |
>>                             |               |            |
>>    uioa_t                   |               |            | Deprecated
>>    UIO_ASYNC                |               |            | Deprecated
>>  
>> +==========================================================================================+
>>  
>>
>>
>>  * The project's deliverables will all go into the OS/NET
>>    Consolidation, so no contracts are required.
>>
>>
>>  == Using the New VOP Interfaces for Zero-copy ==
>>
>>  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction 
>> with
>>  VOP_READ() or VOP_WRITE() to implement zero-copy read or write.
>>  a. Read
>>
>>     In a normal read, the consumer allocates the data buffer and 
>> passes it to
>>     VOP_READ().  The provider initiates the I/O, and copies the data 
>> from its
>>     own cache buffer to the consumer supplied buffer.
>>
>>     To avoid the copy (initiating a zero-copy read), the consumer 
>> first calls
>>     VOP_REQZCBUF() to inform the provider to prepare to loan out its 
>> cache
>>     buffer.  It then calls VOP_READ().  After the call returns, the 
>> consumer
>>     has direct access to the cache buffer loaned out by the 
>> provider.  After
>>     processing the data, the consumer calls VOP_RETZCBUF() to return 
>> the loaned
>>     cache buffer to the provider.
>>
>>     Here is an illustration using NFSv4 read over TCP:
>>
>>         rfs4_op_read(nfs_argop4 *argop, ...)
>>         {
>>             int zerocopy;
>>             xuio_t *xuio;
>>             ...
>>             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
>>             setup length, offset, etc;
>>             if (VOP_REQZCBUF(vp, UIO_READ, xuio, cr, ct)) {
>>                 zerocopy = 0;
>>                 allocate the data buffer the normal way;
>>                 initialize (uio_t *)xuio;
>>             } else {
>>                 /* xuio has been setup by the provider */
>>                 zerocopy = 1;
>>             }
>>             do_io(FREAD, vp, (uio_t *)xuio, 0, cr, &ct);
>>             ...
>>             if (zerocopy) {
>>                 setup callback mechanism that makes the network layer 
>> call
>>                 VOP_RETZCBUF() and free xuio after the data is sent out;
>>             } else {
>>                 kmem_free(xuio, sizeof(xuio_t));
>>             }
>>         }
>>
>>  b. Write
>>
>>     In a normal write, the consumer allocates the data buffer, loads 
>> the data,
>>     and passes the buffer to VOP_WRITE().  The provider copies the 
>> data from
>>     the consumer supplied buffer to its own cache buffer, and starts 
>> the I/O.
>>
>>     To initiate a zero-copy write, the consumer first calls 
>> VOP_REQZCBUF() to
>>     grab a cache buffer from the provider.  It loads the data 
>> directly to
>>     the loaned cache buffer, and calls VOP_WRITE().  After the call 
>> returns,
>>     the consumer calls VOP_RETZCBUF() to return the loaned cache 
>> buffer to
>>     the provider.
>>
>>     Here is an illustration using NFSv4 write via RDMA:
>>
>>         rfs4_op_write(nfs_argop4 *argop, ...)
>>         {
>>             int zerocopy;
>>             xuio_t *xuio;
>>             ...
>>             xuio = kmem_alloc(sizeof(xuio_t), KM_SLEEP);
>>             setup length, offset, etc;
>>             if (VOP_REQZCBUF(vp, UIO_WRITE, xuio, cr, ct)) {
>>                 zerocopy = 0;
>>                 allocate the data buffer the normal way;
>>                 initialize (uio_t *)xuio;
>>                 xdrrdma_read_from_client(...);
>>             } else {
>>                 /* xuio has been setup by the provider */
>>                 zerocopy = 1;
>>                 xdrrdma_zcopy_read_from_client(..., xuio);
>>             }
>>             do_io(FWRITE, vp, (uio_t *)xuio, 0, cr, &ct);
>>             ...
>>             if (zerocopy) {
>>                 VOP_RETZCBUF(vp, xuio, cr, &ct);
>>             }
>>             kmem_free(xuio, sizeof(xuio_t));
>>         }
>>
>>
>>  References:
>>   [1] PSARC/2003/172 File Event Monitoring   [2] PSARC/2007/227 VFS 
>> Features
>>
>> 6. Resources and Schedule
>>     6.4. Steering Committee requested information
>>        6.4.1. Consolidation C-team Name:
>>         ON
>>     6.5. ARC review type: FastTrack
>>     6.6. ARC Exposure: open
>>
>>   
>
>

Reply via email to