Hi LVM Team! A very good day to you all.
[ I hope this email is the right one now]
I recently experienced an outage where thin pool activation failed,
details are as follows.
Good news is, I was able to recover the pool through thin_repair.
Thank goodness!
There was no infra induced failure i.e. no network, disk, usage over
limit, memory or compute being faulty orover used in any way.
Node was running healthy for 13 days and suddenly hit this issue.
Pool would handle I/O load (including discards), new volume
creation/deletion, and other regular activities.
I tried to identify if there is a direct known issue, but I was unable to.
This generally seems to be some known issue, but I am unable to find a
direct link with the same signature.
a) how to induce thin pool failures at will, so thin pool does not
activate, but repair succeeds, so I can test this recovery in some
controlled form.
b) To your best knowledge this seems a known issue and fixed in a later release?
I did my search at both kernel bugzilla and RHEL - and I am hoping you
can help me find it. Internet searches point to errata pages, but I am
unable to find the
exact ticket, commit that address this. The OCP platform was running a
recent release from RHEL.
Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old though.
c) After spending some time reviewing thin code and the commits since
the mentioned
kernel from kernel.org linux.. I suspect it could be a race with
discard and either IO or device creation/deletion on the same pool
could cause this?
Could the authors here, please confirm my code reading below.
```
*** phase 1 - userspace issues blkdiscard on thin volumes ***
dm-thin.c : thin_bio_map()
→ detects REQ_OP_DISCARD
→ thin_defer_bio_with_throttle(tc, bio)
→ adds bio to tc->deferred_bio_list // QUEUED, not processed
→ wakes pool worker thread
dm-thin.c : do_worker() // runs ASYNCHRONOUSLY
→ process_deferred_bios()
→ process_thin_deferred_bios()
→ process_discard_bio()
→ creates mapping, adds to pool->prepared_discards
→ process_prepared(pool->prepared_discards)
→ process_prepared_discard_no_passdown(m):
→ dm_thin_remove_range(tc->td, begin, end)
[dm-thin-metadata.c]
→ dm_btree_remove_leaves()
[dm-btree-remove.c]
→ data_block_dec() // for each data block
[dm-thin-metadata.c]
→ dm_sm_dec_blocks() // DECREMENTS refcount
[dm-space-map-common.c]
*** phase 2: these steps still be IN PROGRESS or QUEUED when
userspace deletes the thin volume ***
dm-thin.c : thin_dtr() // dmsetup remove
→ list_del_rcu(&tc->list) // removes from
// pool->active_thins
→ synchronize_rcu()
→ dm_pool_close_thin_device(tc->td) // open_count--
→ kfree(tc) // tc FREED
*** does NOT flush pool workqueue *** ← GAP 1
*** does NOT drain prepared_discards *** ← GAP 2
dm-thin.c : process_delete_mesg() // dmsetup message
→ dm_pool_delete_thin_device(pool->pmd, dev_id)
[dm-thin-metadata.c : __delete_device()]
→ dm_btree_remove(&pmd->tl_info, ...) // remove from top-level
[dm-btree-remove.c] // btree
→ subtree_dec() // cascades into:
[dm-thin-metadata.c]
→ dm_btree_del() // walks ALL leaves
[dm-btree.c]
→ data_block_dec() for EVERY remaining block
[dm-thin-metadata.c]
→ dm_sm_dec_blocks() // DECREMENTS refcount
[dm-space-map-common.c] // for ALL blocks
** phase 3: KERNEL (worker thread — still running from Phase 1) ***
dm-thin.c : do_worker() // ASYNC, still running
→ process_prepared(pool->prepared_discards)
→ process_prepared_discard_no_passdown(m):
→ m->tc points to FREED tc // ← use-after-free risk
→ dm_thin_remove_range(tc->td, begin, end)
[dm-thin-metadata.c]
→ dm_btree_remove_leaves()
[dm-btree-remove.c]
→ data_block_dec() // SAME blocks already
[dm-thin-metadata.c] // decremented in
→ dm_sm_dec_blocks() // Phase 2!
[dm-space-map-common.c]
┌──────────────────────────────────────────────────┐
sm_ll_dec_bitmap():
old = sm_lookup_bitmap(ic->bitmap, bit);
switch (old) {
case 0: // ← refcount ALREADY 0
DMERR("unable to decrement block");
return -EINVAL; // -22
}
[dm-space-map-common.c]
└──────────────────────────────────────────────────┘
▼
dm_tm_shadow_block() fails (corrupted space map)
[dm-transaction-manager.c]
▼
dm_pool_inc_data_range() fails with -EINVAL (-22)
[dm-thin-metadata.c]
▼
metadata_operation_failed(pool, "dm_pool_inc_data_range")
[dm-thin.c]
▼
set_pool_mode(pool, PM_READ_ONLY)
[dm-thin.c]
*** POOL IS NOW DEAD ***
```
As always, many thanks for your help.
# issue unable to activate thin pool
```
[Wed Apr 8 17:05:14 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr 8 17:08:11 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr 8 17:08:11 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr 8 17:08:11 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr 8 17:08:11 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr 8 17:08:11 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr 8 17:08:11 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr 8 17:08:31 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr 8 17:08:31 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
[Wed Apr 8 17:08:31 2026] device-mapper: space map common: unable to
decrement block
[Wed Apr 8 17:08:31 2026] device-mapper: space map common:
dm_tm_shadow_block() failed
```
# host and lvm tools version
```
uname -a
Linux kernel: 5.14.0-427.109.1.el9_4
RHEL 9.4
lvm version
2.03.23(2) (2023-11-21)
library: 1.02.197 (2023-11-21)
driver: 4.48.1
```
Below are references to the node block layer.
There was IO, thin volume creations and deletions, IO includes discards too.
```
[root@root core]# lvs -a pwx1
Please remove the lvm.conf global_filter, it is ignored with the devices file.
LV VG Attr LSize Pool Origin
Data% Meta% Move Log Cpy%Sync Convert
1004123733318649769 pwx1 Vwi-a-t--- 50.00g pxpool 660563940592999863 0.25
103699400925372609 pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
1072608604746349133 pwx1 Vwi-a-t--- 50.00g pxpool 941788757364603035 0.25
1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool 59.75
1138695541641144166 pwx1 Vwi-a-t--- 50.00g pxpool 941788757364603035 0.25
136169780918964477 pwx1 Vwi-aot--- 30.00g pxpool 33.33
218651423266852202 pwx1 Vwi-aot--- 5.00g pxpool 3.49
404947242154831849 pwx1 Vwi-aot--- 5.00g pxpool 4.20
440731835552948333 pwx1 Vwi-aot--- 50.00g pxpool 5.59
462681831690737818 pwx1 Vwi-a-t--- 50.00g pxpool 73089959772282964 0.25
519898065353250833 pwx1 Vwi-a-t--- 50.00g pxpool 660563940592999863 0.25
527922274169222783 pwx1 Vwi-aot--- 200.00g pxpool 28.64
537994915504805835 pwx1 Vwi-aot--- 50.00g pxpool 10.88
569690966828279529 pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 59.75
594992999737145586 pwx1 Vwi-aot--- 200.00g pxpool 28.91
660563940592999863 pwx1 Vwi-aot--- 50.00g pxpool 0.25
702358223003836192 pwx1 Vwi-aot--- 200.00g pxpool 28.64
73089959772282964 pwx1 Vwi-aot--- 50.00g pxpool 0.25
793515512579595979 pwx1 Vwi-aot--- 30.00g pxpool 33.33
79731196567060146 pwx1 Vwi-aot--- 50.00g pxpool 10.90
865397616123963982 pwx1 Vwi-aot--- 50.00g pxpool 9.39
866802183893693297 pwx1 Vwi-aot--- 200.00g pxpool 28.91
941788757364603035 pwx1 Vwi-aot--- 50.00g pxpool 0.25
960350716126095496 pwx1 Vwi-a-t--- 50.00g pxpool 73089959772282964 0.25
[lvol0_pmspare] pwx1 ewi------- 2.00g
pxMetaFS pwx1 Vwi-aot--- 64.00g pxpool 0.05
pxpool pwx1 twi-aot--- 1.54t
43.59 5.06 <<< very low tmeta util.
[pxpool_tdata] pwx1 Twi-ao---- 1.54t
[pxpool_tmeta] pwx1 ewi-ao---- 4.00g
pxreserve pwx1 -wi------k 15.00g
[root@root core]#
[root@root core]# vgs pwx1
Please remove the lvm.conf global_filter, it is ignored with the devices file.
VG #PV #LV #SN Attr VSize VFree
pwx1 1 27 0 wz--n- 1.56t 0
[root@root core]# lsblk -s /dev/pwx1/1004123733318649769
NAME MAJ:MIN RM SIZE RO TYPE
MOUNTPOINTS
pwx1-1004123733318649769 253:107 0 50G 0 lvm
└─pwx1-pxpool-tpool 253:14 0 1.5T 0 lvm
├─pwx1-pxpool_tmeta 253:12 0 4G 0 lvm
│ └─md126 9:126 0 1.6T 0 raid0
│ └─eui.00806e28521374ac24a93718000982be 253:10 0 1.6T 0 mpath
│ ├─nvme4n2 259:5 0 1.6T 0 disk
│ ├─nvme5n2 259:8 0 1.6T 0 disk
│ └─nvme6n2 259:11 0 1.6T 0 disk
└─pwx1-pxpool_tdata 253:13 0 1.5T 0 lvm
└─md126 9:126 0 1.6T 0 raid0
└─eui.00806e28521374ac24a93718000982be 253:10 0 1.6T 0 mpath
├─nvme4n2 259:5 0 1.6T 0 disk
├─nvme5n2 259:8 0 1.6T 0 disk
└─nvme6n2 259:11 0 1.6T 0 disk
[root@root core]# ls -al /dev/md/pwx1
lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
[root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing
[root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
0 1008459776 linear 9:126 35653632
1008459776 629145600 linear 9:126 1048307712
1637605376 1673527296 linear 9:126 1681647616
[root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
0 4194304 linear 9:126 1044113408
4194304 4194304 linear 9:126 1677453312
[root@root core]#
[root@root core]# dmsetup table --target multipath
3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1
3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 1
3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 1
3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 1
eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:2 1 1
eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:0 1 1
eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:3 1 1
eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
service-time 0 1 2 259:1 1 1
eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
259:4 1 259:7 1 259:10 1
eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
259:5 1 259:8 1 259:11 1
eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
259:6 1 259:9 1 259:12 1
[root@root core]# dmsetup status --target multipath
3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:2 A 0 0 1
eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:0 A 0 0 1
eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:3 A 0 0 1
eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
1 A 0 1 2 259:1 A 0 0 1
eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
[root@root core]#
[root@root core]# mdadm -D /dev/md/pwx1
/dev/md/pwx1:
Version : 1.2
Creation Time : Tue Mar 17 15:29:31 2026
Raid Level : raid0
Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
Raid Devices : 1
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Mon Mar 23 20:52:51 2026
State : clean
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Chunk Size : 1024K
Consistency Policy : none
Name : any:pwx1
UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
Events : 16
Number Major Minor RaidDevice State
0 253 10 0 active sync /dev/dm-10
[root@root core]#
```