Hey Alberto, My article on this design has just been published. In this article I talk about the chosen technologies and the ReadWriteMany implementation:
https://blog.deckhouse.io/lvm-qcow-csi-driver-shared-san-kubernetes-81455201590e Anticipating, I would like to mention that I have tested the method of exporting the volume from a single QSD instance to another over the user network using NBD, and I faced significant performance issues. Additionally I would like to note that this method is overkill when you already have the data accessible on a backing block device via SAN. We have opted for an approach involving switching the cache.direct during live migration of a virtual machine, assuming that it is not a full-fledged ReadWriteMany and will be used solely for the live migration of virtual machines. Best Regards, Andrei Kvapil On Wed, Aug 16, 2023 at 11:31 AM Alberto Faria <afa...@redhat.com> wrote: > On Mon, Jun 19, 2023 at 6:29 PM kvaps <kva...@gmail.com> wrote: > > Hi Kevin and the community, > > > > I am designing a CSI driver for Kubernetes that allows efficient > > utilization of SAN (Storage Area Network) and supports thin > > provisioning, snapshots, and ReadWriteMany mode for block devices. > > > > To implement this, I have explored several technologies such as > > traditional LVM, LVMThin (which does not support shared mode), and > > QCOW2 on top of block devices. This is the same approach to what oVirt > > uses for thin provisioning over shared LUN: > > > > https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md > > > > Based on benchmark results, I found that the performance degradation > > of block-backed QCOW2 is much lower compared to LVM and LVMThin while > > creating snapshots. > > > > > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352 > > > > Therefore, I have decided to use the same aproach for Kubernetes. > > > > But in Kubernetes, the storage system needs to be self-sufficient and > > not depended to the workload that uses it. Thus unlike oVirt, we have > > no option to use the libvirt interface of the running VM to invoke the > > live-migration. Instead, we should provide pure block device in > > ReadWriteMany mode, where the block device can be writable on multiple > > hosts simultaneously. > > > > To achieve this, I decided to use the qemu-storage-daemon with the > > VDUSE backend. > > > > Other technologies, such as NBD and UBLK, were also considered, and > > their benchmark results can be seen in the same document on the > > different sheet: > > > > > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126 > > > > Taking into account the performance, stability, and versatility, I > > concluded that VDUSE is the optimal choice. To connect the device in > > Kubernetes, the virtio-vdpa interface would be used, and the entire > > scheme could look like this: > > > > > > +---------------------+ +---------------------+ > > | node1 | | node2 | > > | | | | > > | +-----------+ | | +-----------+ | > > | | /dev/vda | | | | /dev/vda | | > > | +-----+-----+ | | +-----+-----+ | > > | | | | | | > > | virtio-vdpa | | virtio-vdpa | > > | | | | | | > > | vduse | | vduse | > > | | | | | | > > | qemu-storage-daemon | | qemu-storage-daemon | > > | | | | | | > > | +------- | -------+ | | +------- | -------+ | > > | | LUN | | | | | LUN | | | > > | | +-----+-----+ | | | | +-----+-----+ | | > > | | | LV (qcow2)| | | | | | LV (qcow2)| | | > > | | +-----------+ | | | | +-----------+ | | > > | +--------+--------+ | | +--------+--------+ | > > | | | | | | > > | | | | | | > > +--------- | ---------+ +--------- | ---------+ > > | | > > | +-----+ | > > +---------| SAN |--------+ > > +-----+ > > > > Despite two independent instances of qemu-storage-daemon for same > > qcow2 disk running successfully on different hosts, I have concerns > > about their proper functioning. Similar to live migration, I think > > they should share the state between each other. > > > > The question is how to make qemu-storage-daemon to share the state > > between multiple nodes, or is qcow2 format inherently stateless and > > does not requires this? > > > > -- > > Best Regards, > > Andrei Kvapil > > Hi Andrei, > > Apologies for not getting back to you sooner. > > Have you made progress on this? > > AIUI, and as others have mentioned, it's not possible to safely access > a qcow2 file from more than one qemu-storage-daemon (qsd) instance at > once. Disabling caching might help ensure consistency of the image's > data, but there would still be no synchronization between the qsd > instances when they are manipulating qcow2 metadata. > > ReadWriteMany block volumes are something that we would eventually > like to support in Subprovisioner [1], for instance so KubeVirt live > migration can work with it. The best we have come up with is to export > the volume from a single qsd instance over the network using NBD, > whenever more than one node has the volume mounted. This means that > all but one node would be accessing the volume with degraded > performance, but that may be acceptable for use cases like KubeVirt > live migration. We would then somehow migrate the qsd instance from > the source node to the destination node whenever the former unmounts > it, so that the migrated VM can access the volume with full > performance. This may require adding live migration support to qsd > itself. > > What are your thoughts on this approach? > > Thanks, > Alberto > > [1] https://gitlab.com/subprovisioner/subprovisioner > >