On Mon, Mar 30, 2020 at 7:08 PM Richard Laager <rlaa...@wiktel.com> wrote:
> My only personal interest in O_DIRECT is for KVM qemu virtualization. It > sounds like I will probably need to set direct=disabled. Alternatively, > if I could get all the writes to be 4K-aligned (e.g. by making all the > virtual disks 4Kn?), then ZFS's O_DIRECT would work. > We were thinking that qemu *would* be able to use O_DIRECT, or at least it wouldn't need direct=disabled. But I think your assessment implies that qemu usually uses O_DIRECT i/o that is not page (4K) aligned, in which case it would get an error. AFAIK, all other filesystems that implement O_DIRECT also fail on non-page-aligned i/o. So it's surprising that qemu would expect something other than that. Maybe I'm missing something here? I'm not that familiar with KVM/qemu deployments, maybe folks do usually use 4Kn virtual disks? > > The rest are some questions for here or the call tomorrow, if you think > they're worthwhile: > Thanks for your questions. Responses below: > > On 3/30/20 5:29 PM, Matthew Ahrens via openzfs-developer wrote: > > It is also a request to optimize write throughput, even if > > this causes in a large increase in latency of individual write requests. > > This was surprising to me. Can you comment on this more? Is this true > even in scenarios like databases? (I honestly don't know. This is above > my level of expertise.) > Typical O_DIRECT semantics on other filesystems is that a write call does not return until the i/o to disk completes. We will be doing the same with ZFS (for block-aligned I/O). This allows the filesystem flexibility to handle the write with less memory and less bcopy()'s, since we can use the user-provided buffer rather than copying it into our own buffer (and keeping track of it, etc). Compared to typical behavior of just copying the data to memory (assuming they are not using O_SYNC), the latency of O_DIRECT is often MUCH worse (milliseconds vs microseconds). So O_DIRECT only makes sense if the application cares much more about throughput than latency. They can achieve high throughput with many concurrent O_DIRECT writes, and/or very large O_DIRECT writes. > > For write() system calls, additional performance may be > > achieved by setting checksum=off and not using compression, > > encryption, RAIDZ, or mirroring. > > Is there a likely use case for this scenario? Databases always come up > in O_DIRECT discussions, but having to have no redundancy to get the > most performance is a serious limitation. (Note: I have no idea how > expensive the one copy is.) > I'm not sure. I could imagine someone comparing ZFS to an alternative filesystem, where they are using O_DIRECT, and the alternative FS has no checksumming, redundancy, etc. And they want ZFS for other reasons (e.g. snapshots, or combining this workload with others that DO need checksumming, compression, etc). This mode would let them get as close as possible to the performance of an alternative, very lightweight filesystem. I know the Lustre folks have measured an impact of this additional bcopy(), and they are glad that it is not needed for Lustre (even with checksum=on, because we know Lustre won't modify the buffer while the write is in progress). > > “Always”: acts as though O_DIRECT was always specified > > What is the use case for this? > If the application is naive (doesn't know about / use O_DIRECT), but the system administrator knows that the application would benefit from O_DIRECT. For example, some versions of "dd" don't have the iflags/oflags=direct option. --matt > > ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-Mf56547ab66c4122e394413a8 Delivery options: https://openzfs.topicbox.com/groups/developer/subscription