On Mon, Mar 30, 2020 at 9:08 PM Richard Laager <rlaa...@wiktel.com> wrote:
> My only personal interest in O_DIRECT is for KVM qemu virtualization. It > sounds like I will probably need to set direct=disabled. Alternatively, > if I could get all the writes to be 4K-aligned (e.g. by making all the > virtual disks 4Kn?), then ZFS's O_DIRECT would work. > I've been thinking about this same thing for bhyve. My read of the proposal is that O_DIRECT for VM disks would have a similar effect as primarycache=metadata. It sounds reasonable so long as volblocksize=4k (or recordsize=4k) and the guest knows to do 4k I/Os. See VIRTIO_BLK_F_TOPOLOGY in the virtio spec. If volblocksize is at the default 8k (or recordsize=128k for files), 4k I/Os in the guest would seem to lead to a lot of I/O inflation in the host unless subsequent adjacent and aligned 4k operations are quickly coalesced. When I've looked at primarycache=metadata, I remember (without access to the data I generated) that the read-modify-write cycle was particularly brutal due to the quick disposal of blocks from the ARC. > The rest are some questions for here or the call tomorrow, if you think > they're worthwhile: > > On 3/30/20 5:29 PM, Matthew Ahrens via openzfs-developer wrote: > > It is also a request to optimize write throughput, even if > > this causes in a large increase in latency of individual write requests. > > This was surprising to me. Can you comment on this more? Is this true > even in scenarios like databases? (I honestly don't know. This is above > my level of expertise.) > > > For write() system calls, additional performance may be > > achieved by setting checksum=off and not using compression, > > encryption, RAIDZ, or mirroring. > > Is there a likely use case for this scenario? Databases always come up > in O_DIRECT discussions, but having to have no redundancy to get the > most performance is a serious limitation. (Note: I have no idea how > expensive the one copy is.) > Before ZFS was a thing, I worked in environments where databases were running using direct I/O with Veritas Database Edition. It turns out that the oracle db also does a block-level checksum. Recovery wasn't automatic, but detection of corrupt blocks was automatic and I believe the DBAs were able to do block-level recovery. This was always done on top of a storage array that provided LUNs that were RAIDed on the backend. For the VM case, this could be interesting if the guest OS is using ZFS and it handles redundancy and self-healing. This seems like a nightmare to manage, particularly if the host and guest sysadmins are not the same person/team. Mike ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T950b02acdf392290-M5b48ef8085982fcb359c20df Delivery options: https://openzfs.topicbox.com/groups/developer/subscription