Hey Alberto,

My article on this design has just been published.
In this article I talk about the chosen technologies and the ReadWriteMany
implementation:

https://blog.deckhouse.io/lvm-qcow-csi-driver-shared-san-kubernetes-81455201590e

Anticipating, I would like to mention that I have tested the method of
exporting the volume from a single QSD instance to another over the user
network using NBD, and I faced significant performance issues. Additionally
I would like to note that this method is overkill when you already have the
data accessible on a backing block device via SAN.

We have opted for an approach involving switching the cache.direct during
live migration of a virtual machine, assuming that it is not a full-fledged
ReadWriteMany and will be used solely for the live migration of virtual
machines.

Best Regards,
Andrei Kvapil


On Wed, Aug 16, 2023 at 11:31 AM Alberto Faria <afa...@redhat.com> wrote:

> On Mon, Jun 19, 2023 at 6:29 PM kvaps <kva...@gmail.com> wrote:
> > Hi Kevin and the community,
> >
> > I am designing a CSI driver for Kubernetes that allows efficient
> > utilization of SAN (Storage Area Network) and supports thin
> > provisioning, snapshots, and ReadWriteMany mode for block devices.
> >
> > To implement this, I have explored several technologies such as
> > traditional LVM, LVMThin (which does not support shared mode), and
> > QCOW2 on top of block devices. This is the same approach to what oVirt
> > uses for thin provisioning over shared LUN:
> >
> > https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md
> >
> > Based on benchmark results, I found that the performance degradation
> > of block-backed QCOW2 is much lower compared to LVM and LVMThin while
> > creating snapshots.
> >
> >
> https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352
> >
> > Therefore, I have decided to use the same aproach for Kubernetes.
> >
> > But in Kubernetes, the storage system needs to be self-sufficient and
> > not depended to the workload that uses it. Thus unlike oVirt, we have
> > no option to use the libvirt interface of the running VM to invoke the
> > live-migration. Instead, we should provide pure block device in
> > ReadWriteMany mode, where the block device can be writable on multiple
> > hosts simultaneously.
> >
> > To achieve this, I decided to use the qemu-storage-daemon with the
> > VDUSE backend.
> >
> > Other technologies, such as NBD and UBLK, were also considered, and
> > their benchmark results can be seen in the same document on the
> > different sheet:
> >
> >
> https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126
> >
> > Taking into account the performance, stability, and versatility, I
> > concluded that VDUSE is the optimal choice. To connect the device in
> > Kubernetes, the virtio-vdpa interface would be used, and the entire
> > scheme could look like this:
> >
> >
> > +---------------------+  +---------------------+
> > | node1               |  | node2               |
> > |                     |  |                     |
> > |    +-----------+    |  |    +-----------+    |
> > |    | /dev/vda  |    |  |    | /dev/vda  |    |
> > |    +-----+-----+    |  |    +-----+-----+    |
> > |          |          |  |          |          |
> > |     virtio-vdpa     |  |     virtio-vdpa     |
> > |          |          |  |          |          |
> > |        vduse        |  |        vduse        |
> > |          |          |  |          |          |
> > | qemu-storage-daemon |  | qemu-storage-daemon |
> > |          |          |  |          |          |
> > | +------- | -------+ |  | +------- | -------+ |
> > | | LUN    |        | |  | | LUN    |        | |
> > | |  +-----+-----+  | |  | |  +-----+-----+  | |
> > | |  | LV (qcow2)|  | |  | |  | LV (qcow2)|  | |
> > | |  +-----------+  | |  | |  +-----------+  | |
> > | +--------+--------+ |  | +--------+--------+ |
> > |          |          |  |          |          |
> > |          |          |  |          |          |
> > +--------- | ---------+  +--------- | ---------+
> >            |                        |
> >            |         +-----+        |
> >            +---------| SAN |--------+
> >                      +-----+
> >
> > Despite two independent instances of qemu-storage-daemon for same
> > qcow2 disk running successfully on different hosts, I have concerns
> > about their proper functioning. Similar to live migration, I think
> > they should share the state between each other.
> >
> > The question is how to make qemu-storage-daemon to share the state
> > between multiple nodes, or is qcow2 format inherently stateless and
> > does not requires this?
> >
> > --
> > Best Regards,
> > Andrei Kvapil
>
> Hi Andrei,
>
> Apologies for not getting back to you sooner.
>
> Have you made progress on this?
>
> AIUI, and as others have mentioned, it's not possible to safely access
> a qcow2 file from more than one qemu-storage-daemon (qsd) instance at
> once. Disabling caching might help ensure consistency of the image's
> data, but there would still be no synchronization between the qsd
> instances when they are manipulating qcow2 metadata.
>
> ReadWriteMany block volumes are something that we would eventually
> like to support in Subprovisioner [1], for instance so KubeVirt live
> migration can work with it. The best we have come up with is to export
> the volume from a single qsd instance over the network using NBD,
> whenever more than one node has the volume mounted. This means that
> all but one node would be accessing the volume with degraded
> performance, but that may be acceptable for use cases like KubeVirt
> live migration. We would then somehow migrate the qsd instance from
> the source node to the destination node whenever the former unmounts
> it, so that the migrated VM can access the volume with full
> performance. This may require adding live migration support to qsd
> itself.
>
> What are your thoughts on this approach?
>
> Thanks,
> Alberto
>
> [1] https://gitlab.com/subprovisioner/subprovisioner
>
>

Reply via email to