Re: [developer] O_DIRECT semantics in ZFS

Matthew Ahrens via openzfs-developer Mon, 30 Mar 2020 20:29:20 -0700

On Mon, Mar 30, 2020 at 7:08 PM Richard Laager <rlaa...@wiktel.com> wrote:

> My only personal interest in O_DIRECT is for KVM qemu virtualization. It
> sounds like I will probably need to set direct=disabled. Alternatively,
> if I could get all the writes to be 4K-aligned (e.g. by making all the
> virtual disks 4Kn?), then ZFS's O_DIRECT would work.
>

We were thinking that qemu *would* be able to use O_DIRECT, or at least it
wouldn't need direct=disabled.  But I think your assessment implies that
qemu usually uses O_DIRECT i/o that is not page (4K) aligned, in which case
it would get an error.  AFAIK, all other filesystems that implement
O_DIRECT also fail on non-page-aligned i/o.  So it's surprising that qemu
would expect something other than that.  Maybe I'm missing something here?
I'm not that familiar with KVM/qemu deployments, maybe folks do usually use
4Kn virtual disks?

>
> The rest are some questions for here or the call tomorrow, if you think
> they're worthwhile:
>

Thanks for your questions.  Responses below:

>
> On 3/30/20 5:29 PM, Matthew Ahrens via openzfs-developer wrote:
> > It is also a request to optimize write throughput, even if
> > this causes in a large increase in latency of individual write requests.
>
> This was surprising to me. Can you comment on this more? Is this true
> even in scenarios like databases? (I honestly don't know. This is above
> my level of expertise.)
>

Typical O_DIRECT semantics on other filesystems is that a write call does
not return until the i/o to disk completes.  We will be doing the same with
ZFS (for block-aligned I/O).  This allows the filesystem flexibility to
handle the write with less memory and less bcopy()'s, since we can use the
user-provided buffer rather than copying it into our own buffer (and
keeping track of it, etc).  Compared to typical behavior of just copying
the data to memory (assuming they are not using O_SYNC), the latency of
O_DIRECT is often MUCH worse (milliseconds vs microseconds).  So O_DIRECT
only makes sense if the application cares much more about throughput than
latency.  They can achieve high throughput with many concurrent O_DIRECT
writes, and/or very large O_DIRECT writes.

> >             For write() system calls, additional performance may be
> >             achieved by setting checksum=off and not using compression,
> >             encryption, RAIDZ, or mirroring.
>
> Is there a likely use case for this scenario? Databases always come up
> in O_DIRECT discussions, but having to have no redundancy to get the
> most performance is a serious limitation. (Note: I have no idea how
> expensive the one copy is.)
>

I'm not sure.  I could imagine someone comparing ZFS to an alternative
filesystem, where they are using O_DIRECT, and the alternative FS has no
checksumming, redundancy, etc.  And they want ZFS for other reasons (e.g.
snapshots, or combining this workload with others that DO need
checksumming, compression, etc).  This mode would let them get as close as
possible to the performance of an alternative, very lightweight
filesystem.  I know the Lustre folks have measured an impact of this
additional bcopy(), and they are glad that it is not needed for Lustre
(even with checksum=on, because we know Lustre won't modify the buffer
while the write is in progress).

> >         “Always”: acts as though O_DIRECT was always specified
>
> What is the use case for this?
>

If the application is naive (doesn't know about / use O_DIRECT), but the
system administrator knows that the application would benefit from
O_DIRECT.  For example, some versions of "dd" don't have the
iflags/oflags=direct option.

--matt

>
>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-Mf56547ab66c4122e394413a8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Re: [developer] O_DIRECT semantics in ZFS

Reply via email to