Following our discussion about O_DIRECT at the last OpenZFS Leadership
meeting
<https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit>
 (video <https://www.youtube.com/watch?v=FVwYAwrKZCU&feature=youtu.be>),
Mark Maybee, Brian Behlendorf, Brian Atkison and I worked through the exact
semantics that O_DIRECT should have for ZFS.  Our proposal is below, and
there are additional details in our design document
<https://docs.google.com/document/d/1C8AgqoRodxutvYIodH39J8hRQ2xcP9poELU6Z6CXmzY/edit?usp=sharing>,
including other options considered, and more reasoning behind the choices
made.  Please let us know if you have any questions about this.  We will
have an opportunity to discuss it at tomorrow's meeting as well (9AM
pacific; zoom <https://delphix.zoom.us/j/454165831>).


*Summary of proposed OpenZFS O_DIRECT semantics:*

Broadly speaking, we interpret O_DIRECT as an indication that the user does
not expect to benefit from caching of their data, and that we should try to
improve performance by taking advantage of that expectation.  It is also a
request to optimize write throughput, even if this causes in a large
increase in latency of individual write requests.

We see O_DIRECT as a tool for sophisticated applications to get greatly
improved performance for certain workloads, especially very high throughput
workloads (gigabytes per second).  For best performance, knowledge of how
O_DIRECT behaves (on ZFS specifically) may be required. However, even naive
use of O_DIRECT will not violate ZFS’s core principles of data integrity
and ease of use, and should result in improved performance in most
circumstances.

Based on the above principles, we plan to implement the following semantics:


   -

   Coherence with buffered I/O
   -

      When a file is accessed with both O_DIRECT and buffered
      (non-O_DIRECT), all readers see the same file contents.
      -

      I.e. O_DIRECT and buffered accesses are coherent.
      -

   Reads
   -

      If the data is already cached in the ARC, or if it’s dirty in the
      DMU, it will be copied from the ARC/DMU
      -

         However, this does not count as an access for ARC retention
         purposes (i.e. the data will fall out of the cache as though
this access
         did not happen)
         -

      If access is not page-aligned (4K-aligned), the request will fail
      with an error.
      -

      The access need not be block-aligned for the i/o to be performed
      directly (bypassing the cache, reading directly into the user buffer).
      (“block”-aligned meaning dn_datablksz, which is controlled by the
      recordsize property.) The non-requested part of the block will be
      discarded. (The above caching behavior still applies - if cached we will
      read from the cache.)
      -

   Writes
   -

      If the data is cached in ARC, or if it’s dirty in the DMU, it will be
      discarded from the ARC/DMU, and the write performed directly.
      -

      If access is not page-aligned (4K-aligned), the request will fail
      with an error.
      -

      If access is not block-aligned, the write will be performed buffered
      (as though O_DIRECT was not specified).  However, if the block was not
      already cached, it will be discarded from the cache after the
TXG completes
      (i.e. after it is written to disk by spa_sync()).  This ensures that
      sequential sub-block O_DIRECT writes do not have pathologically bad
      performance.
      -

      The checksum is guaranteed to always be of the data that is written
      to disk.
      -

         If the access is from another kernel subsystem (e.g. Lustre, NFS,
         iSCSI), we can ensure that the buffer provided is not
concurrently modified
         while ZFS is accessing it.  Therefore we can send the user’s buffer
         directly to the checksumming, compression, encryption, RAID
parity routines
         and to the disk driver, without making a copy into a temporary buffer.
         -

         However, if the access is via a write() system call, then we
         assume that another user thread could be concurrently
modifying the buffer
         (via memory stores).  In this case, if:
         -

            The checksum is not “off”
            -

            OR compression is not “off”
            -

            OR encryption is not “off”
            -

            OR RAIDZ/DRAID is used
            -

            OR mirroring is used
            -

         THEN, we will make a temporary copy of the buffer to ensure that
         it is not modified between when the data is read by
         checksumming/compression/RAID and when it is written to disk.
         -

         For write() system calls, additional performance may be achieved
         by setting checksum=off and not using compression,
encryption, RAIDZ, or
         mirroring.
         -

   O_SYNC and O_DIRECT are orthogonal (i.e. O_DIRECT does not imply that
   the data is persistent on disk - O_SYNC must also be specified if this is
   desired).
   -

   Properties
   -

      There will be one new property, named “direct” or similar, with the
      following values:
      -

      “Disabled”: the old behavior of ignoring O_DIRECT
      -

      “Default”: the new behavior which is described above (this is the
      default setting)
      -

      “Always”: acts as though O_DIRECT was always specified

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-M04f8db4dd4502982be9bec58
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to