Hi, I am seeing occasional hard to reproduce failures to lvcreate a thin LV, with transaction ID mismatch errors.
The system is a self-managed compute node that uses thin LVs for base system software and containers - each service has a separate thin LV as its rootfs, and the system takes a fresh thin snapshot of the installed contents at every boot. During system bring up we have two concurrent processes adding, deleting, and renaming thin LVs from a single thin pool: 1 - a login script that creates a thin snapshot of a minimal rootfs for each user, then launches an LXC container with that rootfs, and leaves the user in bash running in that container. If any issues occur in any of that process, it will lvremove the snapshot and retry several times. Although it creates a container, the script itself runs as sudo root, not inside any container/namespaces. 2 - a software install service that takes system services packaged as containers and creates thin LVs based on the container image layer set, and then takes an additional thin snapshot to be mounted for the current boot. this last snapshot is multi-step, with an initial lvcreate of a temp name and a final lvrename. Neither of these processes are in containers or on a VM with a shared volume, so they should be seeing the same LVM lock files, as far as I can tell. This overall approach has been stable for a long time, but a recent change has caused these to overlap more frequently, and we are now seeing failures in lvcreate with a transaction id mismatch when the install service tries to create its temporary LV - here's a snippet from one such log: Error: lvm lvcreate --activate=y --setactivationskip=y --ignoreactivationskip --name=tmp-extract-414e5ec83c02133eae2984ee4 25b22589bca058d --snapshot vg_ifc0/5e9a280e11efbc75ed8f01bdd7b58559c373b451b72921555be5c2eaf93d27b2: exit status 5: /dev/sdh: open failed: No medium found /dev/sdi: open failed: No medium found /dev/sdj:ound /dev/sdk: open failed: No medium found /dev/sdh: open failed: No medium found /dev/sdi: open failed: No medium found /dev/sdj: open failed: No medium found /dev/sdk: open failed: No medium fou ThinDataLV-tpool (251:3) transaction_id is 147, while expected 148. Failed to suspend vg_ifc0/ThinDataLV with queued messages. Due to some failure recovery loops, these services are running lvcreate/lvremove/lvrename (on same VG but different LVs) as often as 5 times per second, which seems fast but doesn't seem like it should be a problem. Looking through past messages to this list, it looks like previous cases were due to sharing volumes between containers/vms without a common lock dir, which we are not doing. Any thoughts on how to further debug or avoid this issue? I can provide the lvm metadata backup files if that would help - there are a lot of them, as once it starts failing, the system retries frequently. Ihis is on Ubuntu 20.04, with lvm 2.03.07(2) (ubuntu package version 2.03.07-1ubuntu1) and a custom kernel built from 5.15.68. Thanks! -mike
_______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://listman.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/