Re: Reg dm thin pool metadata inconsistency

Ming Hung Tsai Tue, 14 Apr 2026 04:29:01 -0700

Hi,

I have two questions posted inline.


On Tue, Apr 14, 2026 at 1:59 AM Lakshmi Narasimhan Sundararajan
<[email protected]> wrote:
>
> Good day!
>
> A gentle ping to hear an update from the authors here. More updates inline.
>
> On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan
> <[email protected]> wrote:
> >
> > Hi LVM Team! A very good day to you all.
> > [ I hope this email is the right one now]
> >
> > I recently experienced an outage where thin pool activation failed,
> > details are as follows.
> > Good news is, I was able to recover the pool through thin_repair.
> > Thank goodness!
> >
> > There was no infra induced failure i.e. no network, disk, usage over
> > limit, memory or compute being faulty orover used in any way.
> > Node was running healthy for 13 days and suddenly hit this issue.
> > Pool would handle I/O load (including discards), new volume
> > creation/deletion, and other regular activities.
> >
> > I tried to identify if there is a direct known issue, but I was unable to.
> > This generally seems to be some known issue, but I am unable to find a
> > direct link with the same signature.
> >
> > a) how to induce thin pool failures at will, so thin pool does not
> > activate, but repair succeeds, so  I can test this recovery in some
> > controlled form.
>
> I have found a way, I think I can pull out the metadata xml and modify
> highest transaction and rewrite the metadata and swap the pool to
> recreate this condition.
> pool activation will fail and thin_repair can correct it. Any easier
> way that my suggestion, please feel free to suggest.

Were you able to reproduce the issue on your end? I'm concerned that
using metadata rebuilt from XML might not trigger the bug because the
rebuilt layout differs from the original.

Could you please provide the raw metadata image prior to any repairs?
This will allow us to investigate the issue further.


> > b) To your best knowledge this seems a known issue and fixed in a later 
> > release?
> > I did my search at both kernel bugzilla and RHEL - and I am hoping you
> > can help me find it. Internet searches point to errata pages, but I am
> > unable to find the
> > exact ticket, commit that address this. The OCP platform was running a
> > recent release from RHEL.
> > Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years old 
> > though.
>
> So far I and my team have not been able to reproduce this issue, and
> look to your help confirming whether
> a) is this known already and is fixed!
> b) whats the safest kernel to upgrade to?
> c) still an open issue!
>
> >
> > c) After spending some time reviewing thin code and the commits since
> > the mentioned
> > kernel from kernel.org linux.. I suspect it could be a race with
> > discard and either IO or device creation/deletion on the same pool
> > could cause this?
> > Could the authors here, please confirm my code reading below.
> > ```
>
> As mentioned we tried a focussed reproducer around this, but unable to
> trigger the issue.
> There are volume creation/deletions, snapshot creations and deletions,
> discards and regular IO at any point on the thin pool.
> And in addition, there would be calls to reserve/release the thin
> metadata to capture diff for backup between volumes.
> These are serialized at our app layer and I also see these are
> serialized within lvm layer too.
>
> Our volume deletions are 2 phased, based on our earlier discussion in
> this thread, we had observed high IO latency when volumes are deleted
> and the suggestion from this team was to discard and then delete
> volumes to keep the deletion time short.
>
> I hope this is giving enough context to understand this better.
> Unfortunately, since I am unable to reproduce this and have this
> sighted now twice at customer, I have no more datapoints to add.
> Would be willing to hear out if you have any suggestions I can pursue in 
> house.
>
> Best regards
>
>
> > *** phase 1 - userspace issues blkdiscard on thin volumes ***
> >   dm-thin.c : thin_bio_map()
> >     → detects REQ_OP_DISCARD
> >     → thin_defer_bio_with_throttle(tc, bio)
> >       → adds bio to tc->deferred_bio_list        // QUEUED, not processed
> >       → wakes pool worker thread
> >
> >   dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
> >     → process_deferred_bios()
> >       → process_thin_deferred_bios()
> >         → process_discard_bio()
> >           → creates mapping, adds to pool->prepared_discards
> >     → process_prepared(pool->prepared_discards)
> >       → process_prepared_discard_no_passdown(m):
> >         → dm_thin_remove_range(tc->td, begin, end)
> >             [dm-thin-metadata.c]
> >           → dm_btree_remove_leaves()
> >               [dm-btree-remove.c]
> >             → data_block_dec()                    // for each data block
> >                 [dm-thin-metadata.c]
> >               → dm_sm_dec_blocks()                // DECREMENTS refcount
> >                   [dm-space-map-common.c]
> >
> > ***  phase 2: these steps still be IN PROGRESS or QUEUED when
> > userspace deletes the thin volume ***
> >
> >   dm-thin.c : thin_dtr()                          // dmsetup remove
> >     → list_del_rcu(&tc->list)                     // removes from
> >                                                   //   pool->active_thins
> >     → synchronize_rcu()
> >     → dm_pool_close_thin_device(tc->td)           // open_count--
> >     → kfree(tc)                                   // tc FREED
> >
> >     *** does NOT flush pool workqueue ***          ← GAP 1
> >     *** does NOT drain prepared_discards ***       ← GAP 2
> >
> >   dm-thin.c : process_delete_mesg()               // dmsetup message
> >     → dm_pool_delete_thin_device(pool->pmd, dev_id)
> >         [dm-thin-metadata.c : __delete_device()]
> >       → dm_btree_remove(&pmd->tl_info, ...)       // remove from top-level
> >           [dm-btree-remove.c]                      //   btree
> >         → subtree_dec()                            // cascades into:
> >             [dm-thin-metadata.c]
> >           → dm_btree_del()                         // walks ALL leaves
> >               [dm-btree.c]
> >             → data_block_dec() for EVERY remaining block
> >                 [dm-thin-metadata.c]
> >               → dm_sm_dec_blocks()                 // DECREMENTS refcount
> >                   [dm-space-map-common.c]          //   for ALL blocks
> >
> > ** phase 3: KERNEL (worker thread — still running from Phase 1) ***
> >   dm-thin.c : do_worker()                         // ASYNC, still running
> >     → process_prepared(pool->prepared_discards)
> >       → process_prepared_discard_no_passdown(m):
> >         → m->tc points to FREED tc                // ← use-after-free risk
> >         → dm_thin_remove_range(tc->td, begin, end)
> >             [dm-thin-metadata.c]
> >           → dm_btree_remove_leaves()
> >               [dm-btree-remove.c]
> >             → data_block_dec()                    // SAME blocks already
> >                 [dm-thin-metadata.c]              //   decremented in
> >               → dm_sm_dec_blocks()                //   Phase 2!
> >                   [dm-space-map-common.c]
> >
> >                 ┌──────────────────────────────────────────────────┐
> >                   sm_ll_dec_bitmap():
> >                     old = sm_lookup_bitmap(ic->bitmap, bit);
> >                     switch (old) {
> >                     case 0:  // ← refcount ALREADY 0
> >                       DMERR("unable to decrement block");
> >                       return -EINVAL;  // -22
> >                     }
> >                                  [dm-space-map-common.c]
> >                 └──────────────────────────────────────────────────┘
> >
> >                           ▼
> >                 dm_tm_shadow_block() fails (corrupted space map)
> >                     [dm-transaction-manager.c]
> >
> >                           ▼
> >                 dm_pool_inc_data_range() fails with -EINVAL (-22)
> >                     [dm-thin-metadata.c]
> >
> >                           ▼
> >                 metadata_operation_failed(pool, "dm_pool_inc_data_range")
> >                     [dm-thin.c]
> >
> >                           ▼
> >                 set_pool_mode(pool, PM_READ_ONLY)
> >                     [dm-thin.c]
> >
> >                 *** POOL IS NOW DEAD ***
> > ```
> >
> >
> >
> > As always, many thanks for your help.
> >
> >
> > # issue unable to activate thin pool
> > ```
> > [Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > decrement block
> > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > dm_tm_shadow_block() failed
> > ```
> >
> > # host and lvm tools version
> > ```
> > uname -a
> > Linux kernel: 5.14.0-427.109.1.el9_4
> > RHEL 9.4
> >
> > lvm version
> > 2.03.23(2) (2023-11-21)
> > library: 1.02.197 (2023-11-21)
> > driver: 4.48.1
> > ```
> >
> > Below are references to the node block layer.
> > There was IO, thin volume creations and deletions, IO includes discards too.
> > ```
> > [root@root core]# lvs -a pwx1
> >   Please remove the lvm.conf global_filter, it is ignored with the devices 
> > file.
> >   LV                  VG   Attr       LSize   Pool   Origin
> >   Data%  Meta%  Move Log Cpy%Sync Convert
> >   1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  
> > 0.25
> >   103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 
> > 59.75
> >   1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  
> > 0.25
> >   1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                     
> > 59.75
> >   1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035  
> > 0.25
> >   136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                     
> > 33.33
> >   218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                     
> > 3.49
> >   404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                     
> > 4.20
> >   440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                     
> > 5.59
> >   462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   
> > 0.25
> >   519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863  
> > 0.25
> >   527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                     
> > 28.64
> >   537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                     
> > 10.88
> >   569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 1115712468847455249 
> > 59.75
> >   594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                     
> > 28.91
> >   660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                     
> > 0.25
> >   702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                     
> > 28.64
> >   73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                     
> > 0.25
> >   793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                     
> > 33.33
> >   79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                     
> > 10.90
> >   865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                     
> > 9.39
> >   866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                     
> > 28.91
> >   941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                     
> > 0.25
> >   960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964   
> > 0.25
> >   [lvol0_pmspare]     pwx1 ewi-------   2.00g
> >   pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                     
> > 0.05
> >   pxpool              pwx1 twi-aot---   1.54t
> >   43.59  5.06 <<< very low tmeta util.
> >   [pxpool_tdata]      pwx1 Twi-ao----   1.54t
> >   [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
> >   pxreserve           pwx1 -wi------k  15.00g
> > [root@root core]#
> > [root@root core]# vgs pwx1
> >   Please remove the lvm.conf global_filter, it is ignored with the devices 
> > file.
> >   VG   #PV #LV #SN Attr   VSize VFree
> >   pwx1   1  27   0 wz--n- 1.56t    0
> > [root@root core]# lsblk -s /dev/pwx1/1004123733318649769
> > NAME                                         MAJ:MIN RM  SIZE RO TYPE
> > MOUNTPOINTS
> > pwx1-1004123733318649769                     253:107  0   50G  0 lvm
> > └─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
> >   ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
> >   │ └─md126                                    9:126  0  1.6T  0 raid0
> >   │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> >   │     ├─nvme4n2                            259:5    0  1.6T  0 disk
> >   │     ├─nvme5n2                            259:8    0  1.6T  0 disk
> >   │     └─nvme6n2                            259:11   0  1.6T  0 disk
> >   └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
> >     └─md126                                    9:126  0  1.6T  0 raid0
> >       └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> >         ├─nvme4n2                            259:5    0  1.6T  0 disk
> >         ├─nvme5n2                            259:8    0  1.6T  0 disk
> >         └─nvme6n2                            259:11   0  1.6T  0 disk
> > [root@root core]# ls -al /dev/md/pwx1
> > lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
> > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
> > 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing

The second feature flag was truncated. Does the thin-pool has
no_discard_passdown enabled?


> > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
> > 0 1008459776 linear 9:126 35653632
> > 1008459776 629145600 linear 9:126 1048307712
> > 1637605376 1673527296 linear 9:126 1681647616
> > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
> > 0 4194304 linear 9:126 1044113408
> > 4194304 4194304 linear 9:126 1677453312
> > [root@root core]#
> > [root@root core]# dmsetup table --target multipath
> > 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:16 1 
> > 1
> > 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:64 1 
> > 1
> > 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:80 1 
> > 1
> > 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1
> > 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 8:48 1 
> > 1
> > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:2 1 1
> > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:0 1 1
> > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:3 1 1
> > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
> > service-time 0 1 2 259:1 1 1
> > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
> > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > 259:4 1 259:7 1 259:10 1
> > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
> > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > 259:5 1 259:8 1 259:11 1
> > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
> > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > 259:6 1 259:9 1 259:12 1
> > [root@root core]# dmsetup status --target multipath
> > 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 0 1
> > 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 0 1
> > 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 0 1
> > 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 1
> > 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 0 1
> > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:2 A 0 0 1
> > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:0 A 0 0 1
> > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:3 A 0 0 1
> > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
> > 1 A 0 1 2 259:1 A 0 0 1
> > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
> > 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
> > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
> > 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
> > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
> > 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
> > [root@root core]#
> > [root@root core]# mdadm -D /dev/md/pwx1
> > /dev/md/pwx1:
> >            Version : 1.2
> >      Creation Time : Tue Mar 17 15:29:31 2026
> >         Raid Level : raid0
> >         Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
> >       Raid Devices : 1
> >      Total Devices : 1
> >        Persistence : Superblock is persistent
> >
> >        Update Time : Mon Mar 23 20:52:51 2026
> >              State : clean
> >     Active Devices : 1
> >    Working Devices : 1
> >     Failed Devices : 0
> >      Spare Devices : 0
> >
> >         Chunk Size : 1024K
> >
> > Consistency Policy : none
> >
> >               Name : any:pwx1
> >               UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
> >             Events : 16
> >
> >     Number   Major   Minor   RaidDevice State
> >        0     253       10        0      active sync   /dev/dm-10
> > [root@root core]#
> > ```
>

Re: Reg dm thin pool metadata inconsistency

Reply via email to