Hi Ming, I do not have the raw metadata unfortunately, did not collect from the customer recovery. We do have discard_passdown enabled on those setups where the issue occurred.
Have you seen this problem before? On Tue, Apr 14, 2026 at 4:58 PM Ming Hung Tsai <[email protected]> wrote: > > Hi, > > I have two questions posted inline. > > On Tue, Apr 14, 2026 at 1:59 AM Lakshmi Narasimhan Sundararajan > <[email protected]> wrote: > > > > Good day! > > > > A gentle ping to hear an update from the authors here. More updates inline. > > > > On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan > > <[email protected]> wrote: > > > > > > Hi LVM Team! A very good day to you all. > > > [ I hope this email is the right one now] > > > > > > I recently experienced an outage where thin pool activation failed, > > > details are as follows. > > > Good news is, I was able to recover the pool through thin_repair. > > > Thank goodness! > > > > > > There was no infra induced failure i.e. no network, disk, usage over > > > limit, memory or compute being faulty orover used in any way. > > > Node was running healthy for 13 days and suddenly hit this issue. > > > Pool would handle I/O load (including discards), new volume > > > creation/deletion, and other regular activities. > > > > > > I tried to identify if there is a direct known issue, but I was unable to. > > > This generally seems to be some known issue, but I am unable to find a > > > direct link with the same signature. > > > > > > a) how to induce thin pool failures at will, so thin pool does not > > > activate, but repair succeeds, so I can test this recovery in some > > > controlled form. > > > > I have found a way, I think I can pull out the metadata xml and modify > > highest transaction and rewrite the metadata and swap the pool to > > recreate this condition. > > pool activation will fail and thin_repair can correct it. Any easier > > way that my suggestion, please feel free to suggest. > > Were you able to reproduce the issue on your end? I'm concerned that > using metadata rebuilt from XML might not trigger the bug because the > rebuilt layout differs from the original. > > Could you please provide the raw metadata image prior to any repairs? > This will allow us to investigate the issue further. > > > > > b) To your best knowledge this seems a known issue and fixed in a later > > > release? > > > I did my search at both kernel bugzilla and RHEL - and I am hoping you > > > can help me find it. Internet searches point to errata pages, but I am > > > unable to find the > > > exact ticket, commit that address this. The OCP platform was running a > > > recent release from RHEL. > > > Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old > > > though. > > > > So far I and my team have not been able to reproduce this issue, and > > look to your help confirming whether > > a) is this known already and is fixed! > > b) whats the safest kernel to upgrade to? > > c) still an open issue! > > > > > > > > c) After spending some time reviewing thin code and the commits since > > > the mentioned > > > kernel from kernel.org linux.. I suspect it could be a race with > > > discard and either IO or device creation/deletion on the same pool > > > could cause this? > > > Could the authors here, please confirm my code reading below. > > > ``` > > > > As mentioned we tried a focussed reproducer around this, but unable to > > trigger the issue. > > There are volume creation/deletions, snapshot creations and deletions, > > discards and regular IO at any point on the thin pool. > > And in addition, there would be calls to reserve/release the thin > > metadata to capture diff for backup between volumes. > > These are serialized at our app layer and I also see these are > > serialized within lvm layer too. > > > > Our volume deletions are 2 phased, based on our earlier discussion in > > this thread, we had observed high IO latency when volumes are deleted > > and the suggestion from this team was to discard and then delete > > volumes to keep the deletion time short. > > > > I hope this is giving enough context to understand this better. > > Unfortunately, since I am unable to reproduce this and have this > > sighted now twice at customer, I have no more datapoints to add. > > Would be willing to hear out if you have any suggestions I can pursue in > > house. > > > > Best regards > > > > > > > *** phase 1 - userspace issues blkdiscard on thin volumes *** > > > dm-thin.c : thin_bio_map() > > > → detects REQ_OP_DISCARD > > > → thin_defer_bio_with_throttle(tc, bio) > > > → adds bio to tc->deferred_bio_list // QUEUED, not processed > > > → wakes pool worker thread > > > > > > dm-thin.c : do_worker() // runs ASYNCHRONOUSLY > > > → process_deferred_bios() > > > → process_thin_deferred_bios() > > > → process_discard_bio() > > > → creates mapping, adds to pool->prepared_discards > > > → process_prepared(pool->prepared_discards) > > > → process_prepared_discard_no_passdown(m): > > > → dm_thin_remove_range(tc->td, begin, end) > > > [dm-thin-metadata.c] > > > → dm_btree_remove_leaves() > > > [dm-btree-remove.c] > > > → data_block_dec() // for each data block > > > [dm-thin-metadata.c] > > > → dm_sm_dec_blocks() // DECREMENTS refcount > > > [dm-space-map-common.c] > > > > > > *** phase 2: these steps still be IN PROGRESS or QUEUED when > > > userspace deletes the thin volume *** > > > > > > dm-thin.c : thin_dtr() // dmsetup remove > > > → list_del_rcu(&tc->list) // removes from > > > // pool->active_thins > > > → synchronize_rcu() > > > → dm_pool_close_thin_device(tc->td) // open_count-- > > > → kfree(tc) // tc FREED > > > > > > *** does NOT flush pool workqueue *** ← GAP 1 > > > *** does NOT drain prepared_discards *** ← GAP 2 > > > > > > dm-thin.c : process_delete_mesg() // dmsetup message > > > → dm_pool_delete_thin_device(pool->pmd, dev_id) > > > [dm-thin-metadata.c : __delete_device()] > > > → dm_btree_remove(&pmd->tl_info, ...) // remove from top-level > > > [dm-btree-remove.c] // btree > > > → subtree_dec() // cascades into: > > > [dm-thin-metadata.c] > > > → dm_btree_del() // walks ALL leaves > > > [dm-btree.c] > > > → data_block_dec() for EVERY remaining block > > > [dm-thin-metadata.c] > > > → dm_sm_dec_blocks() // DECREMENTS refcount > > > [dm-space-map-common.c] // for ALL blocks > > > > > > ** phase 3: KERNEL (worker thread — still running from Phase 1) *** > > > dm-thin.c : do_worker() // ASYNC, still running > > > → process_prepared(pool->prepared_discards) > > > → process_prepared_discard_no_passdown(m): > > > → m->tc points to FREED tc // ← use-after-free risk > > > → dm_thin_remove_range(tc->td, begin, end) > > > [dm-thin-metadata.c] > > > → dm_btree_remove_leaves() > > > [dm-btree-remove.c] > > > → data_block_dec() // SAME blocks already > > > [dm-thin-metadata.c] // decremented in > > > → dm_sm_dec_blocks() // Phase 2! > > > [dm-space-map-common.c] > > > > > > ┌──────────────────────────────────────────────────┐ > > > sm_ll_dec_bitmap(): > > > old = sm_lookup_bitmap(ic->bitmap, bit); > > > switch (old) { > > > case 0: // ← refcount ALREADY 0 > > > DMERR("unable to decrement block"); > > > return -EINVAL; // -22 > > > } > > > [dm-space-map-common.c] > > > └──────────────────────────────────────────────────┘ > > > > > > ▼ > > > dm_tm_shadow_block() fails (corrupted space map) > > > [dm-transaction-manager.c] > > > > > > ▼ > > > dm_pool_inc_data_range() fails with -EINVAL (-22) > > > [dm-thin-metadata.c] > > > > > > ▼ > > > metadata_operation_failed(pool, "dm_pool_inc_data_range") > > > [dm-thin.c] > > > > > > ▼ > > > set_pool_mode(pool, PM_READ_ONLY) > > > [dm-thin.c] > > > > > > *** POOL IS NOW DEAD *** > > > ``` > > > > > > > > > > > > As always, many thanks for your help. > > > > > > > > > # issue unable to activate thin pool > > > ``` > > > [Wed Apr 8 17:05:14 2026] device-mapper: space map common: unable to > > > decrement block > > > [Wed Apr 8 17:08:11 2026] device-mapper: space map common: unable to > > > decrement block > > > [Wed Apr 8 17:08:11 2026] device-mapper: space map common: > > > dm_tm_shadow_block() failed > > > [Wed Apr 8 17:08:11 2026] device-mapper: space map common: unable to > > > decrement block > > > [Wed Apr 8 17:08:11 2026] device-mapper: space map common: > > > dm_tm_shadow_block() failed > > > [Wed Apr 8 17:08:11 2026] device-mapper: space map common: unable to > > > decrement block > > > [Wed Apr 8 17:08:11 2026] device-mapper: space map common: > > > dm_tm_shadow_block() failed > > > [Wed Apr 8 17:08:31 2026] device-mapper: space map common: unable to > > > decrement block > > > [Wed Apr 8 17:08:31 2026] device-mapper: space map common: > > > dm_tm_shadow_block() failed > > > [Wed Apr 8 17:08:31 2026] device-mapper: space map common: unable to > > > decrement block > > > [Wed Apr 8 17:08:31 2026] device-mapper: space map common: > > > dm_tm_shadow_block() failed > > > ``` > > > > > > # host and lvm tools version > > > ``` > > > uname -a > > > Linux kernel: 5.14.0-427.109.1.el9_4 > > > RHEL 9.4 > > > > > > lvm version > > > 2.03.23(2) (2023-11-21) > > > library: 1.02.197 (2023-11-21) > > > driver: 4.48.1 > > > ``` > > > > > > Below are references to the node block layer. > > > There was IO, thin volume creations and deletions, IO includes discards > > > too. > > > ``` > > > [root@root core]# lvs -a pwx1 > > > Please remove the lvm.conf global_filter, it is ignored with the > > > devices file. > > > LV VG Attr LSize Pool Origin > > > Data% Meta% Move Log Cpy%Sync Convert > > > 1004123733318649769 pwx1 Vwi-a-t--- 50.00g pxpool 660563940592999863 > > > 0.25 > > > 103699400925372609 pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 > > > 59.75 > > > 1072608604746349133 pwx1 Vwi-a-t--- 50.00g pxpool 941788757364603035 > > > 0.25 > > > 1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool > > > 59.75 > > > 1138695541641144166 pwx1 Vwi-a-t--- 50.00g pxpool 941788757364603035 > > > 0.25 > > > 136169780918964477 pwx1 Vwi-aot--- 30.00g pxpool > > > 33.33 > > > 218651423266852202 pwx1 Vwi-aot--- 5.00g pxpool > > > 3.49 > > > 404947242154831849 pwx1 Vwi-aot--- 5.00g pxpool > > > 4.20 > > > 440731835552948333 pwx1 Vwi-aot--- 50.00g pxpool > > > 5.59 > > > 462681831690737818 pwx1 Vwi-a-t--- 50.00g pxpool 73089959772282964 > > > 0.25 > > > 519898065353250833 pwx1 Vwi-a-t--- 50.00g pxpool 660563940592999863 > > > 0.25 > > > 527922274169222783 pwx1 Vwi-aot--- 200.00g pxpool > > > 28.64 > > > 537994915504805835 pwx1 Vwi-aot--- 50.00g pxpool > > > 10.88 > > > 569690966828279529 pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 > > > 59.75 > > > 594992999737145586 pwx1 Vwi-aot--- 200.00g pxpool > > > 28.91 > > > 660563940592999863 pwx1 Vwi-aot--- 50.00g pxpool > > > 0.25 > > > 702358223003836192 pwx1 Vwi-aot--- 200.00g pxpool > > > 28.64 > > > 73089959772282964 pwx1 Vwi-aot--- 50.00g pxpool > > > 0.25 > > > 793515512579595979 pwx1 Vwi-aot--- 30.00g pxpool > > > 33.33 > > > 79731196567060146 pwx1 Vwi-aot--- 50.00g pxpool > > > 10.90 > > > 865397616123963982 pwx1 Vwi-aot--- 50.00g pxpool > > > 9.39 > > > 866802183893693297 pwx1 Vwi-aot--- 200.00g pxpool > > > 28.91 > > > 941788757364603035 pwx1 Vwi-aot--- 50.00g pxpool > > > 0.25 > > > 960350716126095496 pwx1 Vwi-a-t--- 50.00g pxpool 73089959772282964 > > > 0.25 > > > [lvol0_pmspare] pwx1 ewi------- 2.00g > > > pxMetaFS pwx1 Vwi-aot--- 64.00g pxpool > > > 0.05 > > > pxpool pwx1 twi-aot--- 1.54t > > > 43.59 5.06 <<< very low tmeta util. > > > [pxpool_tdata] pwx1 Twi-ao---- 1.54t > > > [pxpool_tmeta] pwx1 ewi-ao---- 4.00g > > > pxreserve pwx1 -wi------k 15.00g > > > [root@root core]# > > > [root@root core]# vgs pwx1 > > > Please remove the lvm.conf global_filter, it is ignored with the > > > devices file. > > > VG #PV #LV #SN Attr VSize VFree > > > pwx1 1 27 0 wz--n- 1.56t 0 > > > [root@root core]# lsblk -s /dev/pwx1/1004123733318649769 > > > NAME MAJ:MIN RM SIZE RO TYPE > > > MOUNTPOINTS > > > pwx1-1004123733318649769 253:107 0 50G 0 lvm > > > └─pwx1-pxpool-tpool 253:14 0 1.5T 0 lvm > > > ├─pwx1-pxpool_tmeta 253:12 0 4G 0 lvm > > > │ └─md126 9:126 0 1.6T 0 raid0 > > > │ └─eui.00806e28521374ac24a93718000982be 253:10 0 1.6T 0 mpath > > > │ ├─nvme4n2 259:5 0 1.6T 0 disk > > > │ ├─nvme5n2 259:8 0 1.6T 0 disk > > > │ └─nvme6n2 259:11 0 1.6T 0 disk > > > └─pwx1-pxpool_tdata 253:13 0 1.5T 0 lvm > > > └─md126 9:126 0 1.6T 0 raid0 > > > └─eui.00806e28521374ac24a93718000982be 253:10 0 1.6T 0 mpath > > > ├─nvme4n2 259:5 0 1.6T 0 disk > > > ├─nvme5n2 259:8 0 1.6T 0 disk > > > └─nvme6n2 259:11 0 1.6T 0 disk > > > [root@root core]# ls -al /dev/md/pwx1 > > > lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126 > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool > > > 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing > > The second feature flag was truncated. Does the thin-pool has > no_discard_passdown enabled? > > > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata > > > 0 1008459776 linear 9:126 35653632 > > > 1008459776 629145600 linear 9:126 1048307712 > > > 1637605376 1673527296 linear 9:126 1681647616 > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta > > > 0 4194304 linear 9:126 1044113408 > > > 4194304 4194304 linear 9:126 1677453312 > > > [root@root core]# > > > [root@root core]# dmsetup table --target multipath > > > 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 > > > 1 1 > > > 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 > > > 1 1 > > > 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 > > > 1 1 > > > 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 > > > 1 1 > > > 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 > > > 1 1 > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1 > > > service-time 0 1 2 259:2 1 1 > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1 > > > service-time 0 1 2 259:0 1 1 > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1 > > > service-time 0 1 2 259:3 1 1 > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1 > > > service-time 0 1 2 259:1 1 1 > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3 > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1 > > > 259:4 1 259:7 1 259:10 1 > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3 > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1 > > > 259:5 1 259:8 1 259:11 1 > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3 > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1 > > > 259:6 1 259:9 1 259:12 1 > > > [root@root core]# dmsetup status --target multipath > > > 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1 > > > 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1 > > > 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1 > > > 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1 > > > 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1 > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1 > > > 1 A 0 1 2 259:2 A 0 0 1 > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1 > > > 1 A 0 1 2 259:0 A 0 0 1 > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1 > > > 1 A 0 1 2 259:3 A 0 0 1 > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1 > > > 1 A 0 1 2 259:1 A 0 0 1 > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0 > > > 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22 > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1 > > > 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11 > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1 > > > 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0 > > > [root@root core]# > > > [root@root core]# mdadm -D /dev/md/pwx1 > > > /dev/md/pwx1: > > > Version : 1.2 > > > Creation Time : Tue Mar 17 15:29:31 2026 > > > Raid Level : raid0 > > > Array Size : 1677589504 (1599.87 GiB 1717.85 GB) > > > Raid Devices : 1 > > > Total Devices : 1 > > > Persistence : Superblock is persistent > > > > > > Update Time : Mon Mar 23 20:52:51 2026 > > > State : clean > > > Active Devices : 1 > > > Working Devices : 1 > > > Failed Devices : 0 > > > Spare Devices : 0 > > > > > > Chunk Size : 1024K > > > > > > Consistency Policy : none > > > > > > Name : any:pwx1 > > > UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021 > > > Events : 16 > > > > > > Number Major Minor RaidDevice State > > > 0 253 10 0 active sync /dev/dm-10 > > > [root@root core]# > > > ``` > > > >
