Re: [developer] O_DIRECT semantics in ZFS

Mike Gerdts Tue, 31 Mar 2020 06:49:55 -0700

On Mon, Mar 30, 2020 at 9:08 PM Richard Laager <rlaa...@wiktel.com> wrote:

> My only personal interest in O_DIRECT is for KVM qemu virtualization. It
> sounds like I will probably need to set direct=disabled. Alternatively,
> if I could get all the writes to be 4K-aligned (e.g. by making all the
> virtual disks 4Kn?), then ZFS's O_DIRECT would work.
>

I've been thinking about this same thing for bhyve.  My read of the
proposal is that O_DIRECT for VM disks would have a similar effect as
primarycache=metadata.  It sounds reasonable so long as volblocksize=4k (or
recordsize=4k) and the guest knows to do 4k I/Os.
See VIRTIO_BLK_F_TOPOLOGY in the virtio spec.

If volblocksize is at the default 8k (or recordsize=128k for files), 4k
I/Os in the guest would seem to lead to a lot of I/O inflation in the host
unless subsequent adjacent and aligned 4k operations are quickly
coalesced.  When I've looked at primarycache=metadata, I remember (without
access to the data I generated) that the read-modify-write cycle was
particularly brutal due to the quick disposal of blocks from the ARC.

> The rest are some questions for here or the call tomorrow, if you think
> they're worthwhile:
>
> On 3/30/20 5:29 PM, Matthew Ahrens via openzfs-developer wrote:
> > It is also a request to optimize write throughput, even if
> > this causes in a large increase in latency of individual write requests.
>
> This was surprising to me. Can you comment on this more? Is this true
> even in scenarios like databases? (I honestly don't know. This is above
> my level of expertise.)
>
> >             For write() system calls, additional performance may be
> >             achieved by setting checksum=off and not using compression,
> >             encryption, RAIDZ, or mirroring.
>
> Is there a likely use case for this scenario? Databases always come up
> in O_DIRECT discussions, but having to have no redundancy to get the
> most performance is a serious limitation. (Note: I have no idea how
> expensive the one copy is.)
>

Before ZFS was a thing, I worked in environments where databases were
running using direct I/O with Veritas Database Edition.  It turns out that
the oracle db also does a block-level checksum.  Recovery wasn't automatic,
but detection of corrupt blocks was automatic and I believe the DBAs were
able to do block-level recovery.  This was always done on top of a storage
array that provided LUNs that were RAIDed on the backend.

For the VM case, this could be interesting if the guest OS is using ZFS and
it handles redundancy and self-healing.  This seems like a nightmare to
manage, particularly if the host and guest sysadmins are not the same
person/team.

Mike

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-M5b48ef8085982fcb359c20df
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Re: [developer] O_DIRECT semantics in ZFS

Reply via email to