Hello,

As I got no answer here, maybe someone could advise me a better place to find some hints about this qemu qcow2 corruption issue?

Thank you

--
Nicolas Ecarnot

Le 07/02/2018 à 18:06, Nicolas Ecarnot a écrit :
Hello,

TL; DR : qcow2 images keep getting corrupted. Any workaround?

Long version:
This discussion has already been launched by me on the oVirt and on qemu-block mailing list, under similar circumstances but I learned further things since months and here are some informations :

- We are using 2 oVirt 3.6.7.5-1.el7.centos datacenters, using CentOS 7.{2,3} hosts
- Hosts :
   - CentOS 7.2 1511 :
     - Kernel = 3.10.0 327
     - KVM : 2.3.0-31
     - libvirt : 1.2.17
     - vdsm : 4.17.32-1
   - CentOS 7.3 1611 :
     - Kernel 3.10.0 514
     - KVM : 2.3.0-31
     - libvirt 2.0.0-10
     - vdsm : 4.17.32-1
- Our storage is 2 Equallogic SANs connected via iSCSI on a dedicated network - Depends on weeks, but all in all, there are around 32 hosts, 8 storage domains and for various reasons, very few VMs (less than 200). - One peculiar point is that most of our VMs are provided an additional dedicated network interface that is iSCSI-connected to some volumes of our SAN - these volumes not being part of the oVirt setup. That could lead to a lot of additional iSCSI traffic.

 From times to times, a random VM appears paused by oVirt.
Digging into the oVirt engine logs, then into the host vdsm logs, it appears that the host considers the qcow2 image as corrupted. Along what I consider as a conservative behavior, vdsm stops any interaction with this image and marks it as paused.
Any try to unpause it leads to the same conservative pause.

After having found (https://access.redhat.com/solutions/1173623) the right logical volume hosting the qcow2 image, I can run qemu-img check on it.
- On 80% of my VMs, I find no errors.
- On 15% of them, I find Leaked cluster errors that I can correct using "qemu-img check -r all" - On 5% of them, I find Leaked clusters errors and further fatal errors, which can not be corrected with qemu-img. In rare cases, qemu-img can correct them, but destroys large parts of the image (becomes unusable), and on other cases it can not correct them at all.

Months ago, I already sent a similar message but the error message was about No space left on device (https://www.mail-archive.com/qemu-block@gnu.org/msg00110.html).

This time, I don't have this message about space, but only corruption.

I kept reading and found a similar discussion in the Proxmox group :
https://lists.ovirt.org/pipermail/users/2018-February/086750.html

https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/page-2

What I read similar to my case is :
- usage of qcow2
- heavy disk I/O
- using the virtio-blk driver

In the proxmox thread, they tend to say that using virtio-scsi is the solution. Having asked this question to oVirt experts (https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but it's not clear the driver is to blame.

I agree with the answer Yaniv Kaul gave to me, saying I have to properly report the issue, so I'm longing to know which peculiar information I can give you now.

As you can imagine, all this setup is in production, and for most of the VMs, I can not "play" with them. Moreover, we launched a campaign of nightly stopping every VM, qemu-img check them one by one, then boot.
So it might take some time before I find another corrupted image.
(which I'll preciously store for debug)

Other informations : We very rarely do snapshots, but I'm close to imagine that automated migrations of VMs could trigger similar behaviors on qcow2 images.

Last point about the versions we use : yes that's old, yes we're planning to upgrade, but we don't know when.

Regards,



--
Nicolas ECARNOT

Reply via email to