Hi Lakshmi,

On Tue, Apr 14, 2026 at 8:03 PM Lakshmi Narasimhan Sundararajan
<[email protected]> wrote:
>
> Hi Ming,
>
> I do not have the raw metadata unfortunately, did not collect from the
> customer recovery.
> We do have discard_passdown enabled on those setups where the issue occurred.

Do you mean there's a "discard_passdown" flag in the `dmsetup status` output?
The feature counter for the thin-pool table is set to 2, with
skip_block_zeroing as the first flag, and the second flag appears to
be truncated.
Could you provide the name of the second flag, if present?

"0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing ..."



> Have you seen this problem before?
>
> On Tue, Apr 14, 2026 at 4:58 PM Ming Hung Tsai <[email protected]> wrote:
> >
> > Hi,
> >
> > I have two questions posted inline.
> >
> > On Tue, Apr 14, 2026 at 1:59 AM Lakshmi Narasimhan Sundararajan
> > <[email protected]> wrote:
> > >
> > > Good day!
> > >
> > > A gentle ping to hear an update from the authors here. More updates 
> > > inline.
> > >
> > > On Sat, Apr 11, 2026 at 5:41 PM Lakshmi Narasimhan Sundararajan
> > > <[email protected]> wrote:
> > > >
> > > > Hi LVM Team! A very good day to you all.
> > > > [ I hope this email is the right one now]
> > > >
> > > > I recently experienced an outage where thin pool activation failed,
> > > > details are as follows.
> > > > Good news is, I was able to recover the pool through thin_repair.
> > > > Thank goodness!
> > > >
> > > > There was no infra induced failure i.e. no network, disk, usage over
> > > > limit, memory or compute being faulty orover used in any way.
> > > > Node was running healthy for 13 days and suddenly hit this issue.
> > > > Pool would handle I/O load (including discards), new volume
> > > > creation/deletion, and other regular activities.
> > > >
> > > > I tried to identify if there is a direct known issue, but I was unable 
> > > > to.
> > > > This generally seems to be some known issue, but I am unable to find a
> > > > direct link with the same signature.
> > > >
> > > > a) how to induce thin pool failures at will, so thin pool does not
> > > > activate, but repair succeeds, so  I can test this recovery in some
> > > > controlled form.
> > >
> > > I have found a way, I think I can pull out the metadata xml and modify
> > > highest transaction and rewrite the metadata and swap the pool to
> > > recreate this condition.
> > > pool activation will fail and thin_repair can correct it. Any easier
> > > way that my suggestion, please feel free to suggest.
> >
> > Were you able to reproduce the issue on your end? I'm concerned that
> > using metadata rebuilt from XML might not trigger the bug because the
> > rebuilt layout differs from the original.
> >
> > Could you please provide the raw metadata image prior to any repairs?
> > This will allow us to investigate the issue further.
> >
> >
> > > > b) To your best knowledge this seems a known issue and fixed in a later 
> > > > release?
> > > > I did my search at both kernel bugzilla and RHEL - and I am hoping you
> > > > can help me find it. Internet searches point to errata pages, but I am
> > > > unable to find the
> > > > exact ticket, commit that address this. The OCP platform was running a
> > > > recent release from RHEL.
> > > > Linux kernel: 5.14.0-427.109.1.el9_4 RHEL 9.4 This is likely 2 years 
> > > > old though.
> > >
> > > So far I and my team have not been able to reproduce this issue, and
> > > look to your help confirming whether
> > > a) is this known already and is fixed!
> > > b) whats the safest kernel to upgrade to?
> > > c) still an open issue!
> > >
> > > >
> > > > c) After spending some time reviewing thin code and the commits since
> > > > the mentioned
> > > > kernel from kernel.org linux.. I suspect it could be a race with
> > > > discard and either IO or device creation/deletion on the same pool
> > > > could cause this?
> > > > Could the authors here, please confirm my code reading below.
> > > > ```
> > >
> > > As mentioned we tried a focussed reproducer around this, but unable to
> > > trigger the issue.
> > > There are volume creation/deletions, snapshot creations and deletions,
> > > discards and regular IO at any point on the thin pool.
> > > And in addition, there would be calls to reserve/release the thin
> > > metadata to capture diff for backup between volumes.
> > > These are serialized at our app layer and I also see these are
> > > serialized within lvm layer too.
> > >
> > > Our volume deletions are 2 phased, based on our earlier discussion in
> > > this thread, we had observed high IO latency when volumes are deleted
> > > and the suggestion from this team was to discard and then delete
> > > volumes to keep the deletion time short.
> > >
> > > I hope this is giving enough context to understand this better.
> > > Unfortunately, since I am unable to reproduce this and have this
> > > sighted now twice at customer, I have no more datapoints to add.
> > > Would be willing to hear out if you have any suggestions I can pursue in 
> > > house.
> > >
> > > Best regards
> > >
> > >
> > > > *** phase 1 - userspace issues blkdiscard on thin volumes ***
> > > >   dm-thin.c : thin_bio_map()
> > > >     → detects REQ_OP_DISCARD
> > > >     → thin_defer_bio_with_throttle(tc, bio)
> > > >       → adds bio to tc->deferred_bio_list        // QUEUED, not 
> > > > processed
> > > >       → wakes pool worker thread
> > > >
> > > >   dm-thin.c : do_worker()                         // runs ASYNCHRONOUSLY
> > > >     → process_deferred_bios()
> > > >       → process_thin_deferred_bios()
> > > >         → process_discard_bio()
> > > >           → creates mapping, adds to pool->prepared_discards
> > > >     → process_prepared(pool->prepared_discards)
> > > >       → process_prepared_discard_no_passdown(m):
> > > >         → dm_thin_remove_range(tc->td, begin, end)
> > > >             [dm-thin-metadata.c]
> > > >           → dm_btree_remove_leaves()
> > > >               [dm-btree-remove.c]
> > > >             → data_block_dec()                    // for each data block
> > > >                 [dm-thin-metadata.c]
> > > >               → dm_sm_dec_blocks()                // DECREMENTS refcount
> > > >                   [dm-space-map-common.c]
> > > >
> > > > ***  phase 2: these steps still be IN PROGRESS or QUEUED when
> > > > userspace deletes the thin volume ***
> > > >
> > > >   dm-thin.c : thin_dtr()                          // dmsetup remove
> > > >     → list_del_rcu(&tc->list)                     // removes from
> > > >                                                   //   
> > > > pool->active_thins
> > > >     → synchronize_rcu()
> > > >     → dm_pool_close_thin_device(tc->td)           // open_count--
> > > >     → kfree(tc)                                   // tc FREED
> > > >
> > > >     *** does NOT flush pool workqueue ***          ← GAP 1
> > > >     *** does NOT drain prepared_discards ***       ← GAP 2
> > > >
> > > >   dm-thin.c : process_delete_mesg()               // dmsetup message
> > > >     → dm_pool_delete_thin_device(pool->pmd, dev_id)
> > > >         [dm-thin-metadata.c : __delete_device()]
> > > >       → dm_btree_remove(&pmd->tl_info, ...)       // remove from 
> > > > top-level
> > > >           [dm-btree-remove.c]                      //   btree
> > > >         → subtree_dec()                            // cascades into:
> > > >             [dm-thin-metadata.c]
> > > >           → dm_btree_del()                         // walks ALL leaves
> > > >               [dm-btree.c]
> > > >             → data_block_dec() for EVERY remaining block
> > > >                 [dm-thin-metadata.c]
> > > >               → dm_sm_dec_blocks()                 // DECREMENTS 
> > > > refcount
> > > >                   [dm-space-map-common.c]          //   for ALL blocks
> > > >
> > > > ** phase 3: KERNEL (worker thread — still running from Phase 1) ***
> > > >   dm-thin.c : do_worker()                         // ASYNC, still 
> > > > running
> > > >     → process_prepared(pool->prepared_discards)
> > > >       → process_prepared_discard_no_passdown(m):
> > > >         → m->tc points to FREED tc                // ← use-after-free 
> > > > risk
> > > >         → dm_thin_remove_range(tc->td, begin, end)
> > > >             [dm-thin-metadata.c]
> > > >           → dm_btree_remove_leaves()
> > > >               [dm-btree-remove.c]
> > > >             → data_block_dec()                    // SAME blocks already
> > > >                 [dm-thin-metadata.c]              //   decremented in
> > > >               → dm_sm_dec_blocks()                //   Phase 2!
> > > >                   [dm-space-map-common.c]
> > > >
> > > >                 ┌──────────────────────────────────────────────────┐
> > > >                   sm_ll_dec_bitmap():
> > > >                     old = sm_lookup_bitmap(ic->bitmap, bit);
> > > >                     switch (old) {
> > > >                     case 0:  // ← refcount ALREADY 0
> > > >                       DMERR("unable to decrement block");
> > > >                       return -EINVAL;  // -22
> > > >                     }
> > > >                                  [dm-space-map-common.c]
> > > >                 └──────────────────────────────────────────────────┘
> > > >
> > > >                           ▼
> > > >                 dm_tm_shadow_block() fails (corrupted space map)
> > > >                     [dm-transaction-manager.c]
> > > >
> > > >                           ▼
> > > >                 dm_pool_inc_data_range() fails with -EINVAL (-22)
> > > >                     [dm-thin-metadata.c]
> > > >
> > > >                           ▼
> > > >                 metadata_operation_failed(pool, 
> > > > "dm_pool_inc_data_range")
> > > >                     [dm-thin.c]
> > > >
> > > >                           ▼
> > > >                 set_pool_mode(pool, PM_READ_ONLY)
> > > >                     [dm-thin.c]
> > > >
> > > >                 *** POOL IS NOW DEAD ***
> > > > ```
> > > >
> > > >
> > > >
> > > > As always, many thanks for your help.
> > > >
> > > >
> > > > # issue unable to activate thin pool
> > > > ```
> > > > [Wed Apr  8 17:05:14 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:11 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common: unable to
> > > > decrement block
> > > > [Wed Apr  8 17:08:31 2026] device-mapper: space map common:
> > > > dm_tm_shadow_block() failed
> > > > ```
> > > >
> > > > # host and lvm tools version
> > > > ```
> > > > uname -a
> > > > Linux kernel: 5.14.0-427.109.1.el9_4
> > > > RHEL 9.4
> > > >
> > > > lvm version
> > > > 2.03.23(2) (2023-11-21)
> > > > library: 1.02.197 (2023-11-21)
> > > > driver: 4.48.1
> > > > ```
> > > >
> > > > Below are references to the node block layer.
> > > > There was IO, thin volume creations and deletions, IO includes discards 
> > > > too.
> > > > ```
> > > > [root@root core]# lvs -a pwx1
> > > >   Please remove the lvm.conf global_filter, it is ignored with the 
> > > > devices file.
> > > >   LV                  VG   Attr       LSize   Pool   Origin
> > > >   Data%  Meta%  Move Log Cpy%Sync Convert
> > > >   1004123733318649769 pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863 
> > > >  0.25
> > > >   103699400925372609  pwx1 Vwi-a-t--- 750.00g pxpool 
> > > > 1115712468847455249 59.75
> > > >   1072608604746349133 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035 
> > > >  0.25
> > > >   1115712468847455249 pwx1 Vwi-aot--- 750.00g pxpool                    
> > > >  59.75
> > > >   1138695541641144166 pwx1 Vwi-a-t---  50.00g pxpool 941788757364603035 
> > > >  0.25
> > > >   136169780918964477  pwx1 Vwi-aot---  30.00g pxpool                    
> > > >  33.33
> > > >   218651423266852202  pwx1 Vwi-aot---   5.00g pxpool                    
> > > >  3.49
> > > >   404947242154831849  pwx1 Vwi-aot---   5.00g pxpool                    
> > > >  4.20
> > > >   440731835552948333  pwx1 Vwi-aot---  50.00g pxpool                    
> > > >  5.59
> > > >   462681831690737818  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964  
> > > >  0.25
> > > >   519898065353250833  pwx1 Vwi-a-t---  50.00g pxpool 660563940592999863 
> > > >  0.25
> > > >   527922274169222783  pwx1 Vwi-aot--- 200.00g pxpool                    
> > > >  28.64
> > > >   537994915504805835  pwx1 Vwi-aot---  50.00g pxpool                    
> > > >  10.88
> > > >   569690966828279529  pwx1 Vwi-a-t--- 750.00g pxpool 
> > > > 1115712468847455249 59.75
> > > >   594992999737145586  pwx1 Vwi-aot--- 200.00g pxpool                    
> > > >  28.91
> > > >   660563940592999863  pwx1 Vwi-aot---  50.00g pxpool                    
> > > >  0.25
> > > >   702358223003836192  pwx1 Vwi-aot--- 200.00g pxpool                    
> > > >  28.64
> > > >   73089959772282964   pwx1 Vwi-aot---  50.00g pxpool                    
> > > >  0.25
> > > >   793515512579595979  pwx1 Vwi-aot---  30.00g pxpool                    
> > > >  33.33
> > > >   79731196567060146   pwx1 Vwi-aot---  50.00g pxpool                    
> > > >  10.90
> > > >   865397616123963982  pwx1 Vwi-aot---  50.00g pxpool                    
> > > >  9.39
> > > >   866802183893693297  pwx1 Vwi-aot--- 200.00g pxpool                    
> > > >  28.91
> > > >   941788757364603035  pwx1 Vwi-aot---  50.00g pxpool                    
> > > >  0.25
> > > >   960350716126095496  pwx1 Vwi-a-t---  50.00g pxpool 73089959772282964  
> > > >  0.25
> > > >   [lvol0_pmspare]     pwx1 ewi-------   2.00g
> > > >   pxMetaFS            pwx1 Vwi-aot---  64.00g pxpool                    
> > > >  0.05
> > > >   pxpool              pwx1 twi-aot---   1.54t
> > > >   43.59  5.06 <<< very low tmeta util.
> > > >   [pxpool_tdata]      pwx1 Twi-ao----   1.54t
> > > >   [pxpool_tmeta]      pwx1 ewi-ao----   4.00g
> > > >   pxreserve           pwx1 -wi------k  15.00g
> > > > [root@root core]#
> > > > [root@root core]# vgs pwx1
> > > >   Please remove the lvm.conf global_filter, it is ignored with the 
> > > > devices file.
> > > >   VG   #PV #LV #SN Attr   VSize VFree
> > > >   pwx1   1  27   0 wz--n- 1.56t    0
> > > > [root@root core]# lsblk -s /dev/pwx1/1004123733318649769
> > > > NAME                                         MAJ:MIN RM  SIZE RO TYPE
> > > > MOUNTPOINTS
> > > > pwx1-1004123733318649769                     253:107  0   50G  0 lvm
> > > > └─pwx1-pxpool-tpool                          253:14   0  1.5T  0 lvm
> > > >   ├─pwx1-pxpool_tmeta                        253:12   0    4G  0 lvm
> > > >   │ └─md126                                    9:126  0  1.6T  0 raid0
> > > >   │   └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > > >   │     ├─nvme4n2                            259:5    0  1.6T  0 disk
> > > >   │     ├─nvme5n2                            259:8    0  1.6T  0 disk
> > > >   │     └─nvme6n2                            259:11   0  1.6T  0 disk
> > > >   └─pwx1-pxpool_tdata                        253:13   0  1.5T  0 lvm
> > > >     └─md126                                    9:126  0  1.6T  0 raid0
> > > >       └─eui.00806e28521374ac24a93718000982be 253:10   0  1.6T  0 mpath
> > > >         ├─nvme4n2                            259:5    0  1.6T  0 disk
> > > >         ├─nvme5n2                            259:8    0  1.6T  0 disk
> > > >         └─nvme6n2                            259:11   0  1.6T  0 disk
> > > > [root@root core]# ls -al /dev/md/pwx1
> > > > lrwxrwxrwx. 1 root root 8 Apr 11 11:48 /dev/md/pwx1 -> ../md126
> > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool-tpool
> > > > 0 3311132672 thin-pool 253:12 253:13 128 0 2 skip_block_zeroing
> >
> > The second feature flag was truncated. Does the thin-pool has
> > no_discard_passdown enabled?
> >
> >
> > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tdata
> > > > 0 1008459776 linear 9:126 35653632
> > > > 1008459776 629145600 linear 9:126 1048307712
> > > > 1637605376 1673527296 linear 9:126 1681647616
> > > > [root@root core]# dmsetup table /dev/mapper/pwx1-pxpool_tmeta
> > > > 0 4194304 linear 9:126 1044113408
> > > > 4194304 4194304 linear 9:126 1677453312
> > > > [root@root core]#
> > > > [root@root core]# dmsetup table --target multipath
> > > > 3500a07513c1e23c4: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 
> > > > 8:16 1 1
> > > > 3500a07513c1e2ade: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 
> > > > 8:64 1 1
> > > > 3500a07513c1e2ca8: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 
> > > > 8:80 1 1
> > > > 3500a07513c1e2cf3: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 
> > > > 8:0 1 1
> > > > 3500a07513c1e3afc: 0 3750748848 multipath 0 0 1 1 service-time 0 1 2 
> > > > 8:48 1 1
> > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:2 1 1
> > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:0 1 1
> > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:3 1 1
> > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 0 0 1 1
> > > > service-time 0 1 2 259:1 1 1
> > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 3
> > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > 259:4 1 259:7 1 259:10 1
> > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 3
> > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > 259:5 1 259:8 1 259:11 1
> > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 3
> > > > retain_attached_hw_handler queue_mode bio 0 1 1 queue-length 0 3 1
> > > > 259:6 1 259:9 1 259:12 1
> > > > [root@root core]# dmsetup status --target multipath
> > > > 3500a07513c1e23c4: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:16 A 0 
> > > > 0 1
> > > > 3500a07513c1e2ade: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:64 A 0 
> > > > 0 1
> > > > 3500a07513c1e2ca8: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:80 A 0 
> > > > 0 1
> > > > 3500a07513c1e2cf3: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:0 A 0 0 
> > > > 1
> > > > 3500a07513c1e3afc: 0 3750748848 multipath 2 0 0 0 1 1 A 0 1 2 8:48 A 0 
> > > > 0 1
> > > > eui.000000000000000100a075223f0c4773: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:2 A 0 0 1
> > > > eui.000000000000000100a075223f0c47a6: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:0 A 0 0 1
> > > > eui.000000000000000100a075233fc94da6: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:3 A 0 0 1
> > > > eui.000000000000000100a075233fc94de4: 0 3125627568 multipath 2 0 0 0 1
> > > > 1 A 0 1 2 259:1 A 0 0 1
> > > > eui.00806e28521374ac24a93718000982bd: 0 14680064000 multipath 2 0 0 0
> > > > 1 1 A 0 3 1 259:4 A 1 22 259:7 A 1 21 259:10 A 1 22
> > > > eui.00806e28521374ac24a93718000982be: 0 3355443200 multipath 2 0 0 0 1
> > > > 1 A 0 3 1 259:5 A 1 4 259:8 A 1 6 259:11 A 1 11
> > > > eui.00806e28521374ac24a93718000982bf: 0 134217728 multipath 2 0 0 0 1
> > > > 1 A 0 3 1 259:6 A 1 0 259:9 A 1 0 259:12 A 1 0
> > > > [root@root core]#
> > > > [root@root core]# mdadm -D /dev/md/pwx1
> > > > /dev/md/pwx1:
> > > >            Version : 1.2
> > > >      Creation Time : Tue Mar 17 15:29:31 2026
> > > >         Raid Level : raid0
> > > >         Array Size : 1677589504 (1599.87 GiB 1717.85 GB)
> > > >       Raid Devices : 1
> > > >      Total Devices : 1
> > > >        Persistence : Superblock is persistent
> > > >
> > > >        Update Time : Mon Mar 23 20:52:51 2026
> > > >              State : clean
> > > >     Active Devices : 1
> > > >    Working Devices : 1
> > > >     Failed Devices : 0
> > > >      Spare Devices : 0
> > > >
> > > >         Chunk Size : 1024K
> > > >
> > > > Consistency Policy : none
> > > >
> > > >               Name : any:pwx1
> > > >               UUID : 1716a351:ed3e53e7:0ce83ccd:8d3a3021
> > > >             Events : 16
> > > >
> > > >     Number   Major   Minor   RaidDevice State
> > > >        0     253       10        0      active sync   /dev/dm-10
> > > > [root@root core]#
> > > > ```
> > >
> >
> >
>


Reply via email to