Following our discussion about O_DIRECT at the last OpenZFS Leadership meeting <https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit> (video <https://www.youtube.com/watch?v=FVwYAwrKZCU&feature=youtu.be>), Mark Maybee, Brian Behlendorf, Brian Atkison and I worked through the exact semantics that O_DIRECT should have for ZFS. Our proposal is below, and there are additional details in our design document <https://docs.google.com/document/d/1C8AgqoRodxutvYIodH39J8hRQ2xcP9poELU6Z6CXmzY/edit?usp=sharing>, including other options considered, and more reasoning behind the choices made. Please let us know if you have any questions about this. We will have an opportunity to discuss it at tomorrow's meeting as well (9AM pacific; zoom <https://delphix.zoom.us/j/454165831>).
*Summary of proposed OpenZFS O_DIRECT semantics:* Broadly speaking, we interpret O_DIRECT as an indication that the user does not expect to benefit from caching of their data, and that we should try to improve performance by taking advantage of that expectation. It is also a request to optimize write throughput, even if this causes in a large increase in latency of individual write requests. We see O_DIRECT as a tool for sophisticated applications to get greatly improved performance for certain workloads, especially very high throughput workloads (gigabytes per second). For best performance, knowledge of how O_DIRECT behaves (on ZFS specifically) may be required. However, even naive use of O_DIRECT will not violate ZFS’s core principles of data integrity and ease of use, and should result in improved performance in most circumstances. Based on the above principles, we plan to implement the following semantics: - Coherence with buffered I/O - When a file is accessed with both O_DIRECT and buffered (non-O_DIRECT), all readers see the same file contents. - I.e. O_DIRECT and buffered accesses are coherent. - Reads - If the data is already cached in the ARC, or if it’s dirty in the DMU, it will be copied from the ARC/DMU - However, this does not count as an access for ARC retention purposes (i.e. the data will fall out of the cache as though this access did not happen) - If access is not page-aligned (4K-aligned), the request will fail with an error. - The access need not be block-aligned for the i/o to be performed directly (bypassing the cache, reading directly into the user buffer). (“block”-aligned meaning dn_datablksz, which is controlled by the recordsize property.) The non-requested part of the block will be discarded. (The above caching behavior still applies - if cached we will read from the cache.) - Writes - If the data is cached in ARC, or if it’s dirty in the DMU, it will be discarded from the ARC/DMU, and the write performed directly. - If access is not page-aligned (4K-aligned), the request will fail with an error. - If access is not block-aligned, the write will be performed buffered (as though O_DIRECT was not specified). However, if the block was not already cached, it will be discarded from the cache after the TXG completes (i.e. after it is written to disk by spa_sync()). This ensures that sequential sub-block O_DIRECT writes do not have pathologically bad performance. - The checksum is guaranteed to always be of the data that is written to disk. - If the access is from another kernel subsystem (e.g. Lustre, NFS, iSCSI), we can ensure that the buffer provided is not concurrently modified while ZFS is accessing it. Therefore we can send the user’s buffer directly to the checksumming, compression, encryption, RAID parity routines and to the disk driver, without making a copy into a temporary buffer. - However, if the access is via a write() system call, then we assume that another user thread could be concurrently modifying the buffer (via memory stores). In this case, if: - The checksum is not “off” - OR compression is not “off” - OR encryption is not “off” - OR RAIDZ/DRAID is used - OR mirroring is used - THEN, we will make a temporary copy of the buffer to ensure that it is not modified between when the data is read by checksumming/compression/RAID and when it is written to disk. - For write() system calls, additional performance may be achieved by setting checksum=off and not using compression, encryption, RAIDZ, or mirroring. - O_SYNC and O_DIRECT are orthogonal (i.e. O_DIRECT does not imply that the data is persistent on disk - O_SYNC must also be specified if this is desired). - Properties - There will be one new property, named “direct” or similar, with the following values: - “Disabled”: the old behavior of ignoring O_DIRECT - “Default”: the new behavior which is described above (this is the default setting) - “Always”: acts as though O_DIRECT was always specified ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-M04f8db4dd4502982be9bec58 Delivery options: https://openzfs.topicbox.com/groups/developer/subscription