The commit is pushed to "branch-rh9-5.14.0-427.44.1.vz9.80.x-ovz" and will 
appear at g...@bitbucket.org:openvz/vzkernel.git
after rh9-5.14.0-427.44.1.vz9.80.38
------>
commit 8cb7bdaed670b0e3333c47d9247a512517e16940
Author: Ming Lei <ming....@redhat.com>
Date:   Thu Aug 15 10:47:36 2024 +0800

    ms/block: Fix lockdep warning in blk_mq_mark_tag_wait
    
    JIRA: https://issues.redhat.com/browse/RHEL-56837
    
     ms commit b313a8c835516bdda85025500be866ac8a74e022
     Author: Li Lingfeng <lilingfe...@huawei.com>
     Date:   Thu Aug 15 10:47:36 2024 +0800
    
        block: Fix lockdep warning in blk_mq_mark_tag_wait
    
        Lockdep reported a warning in Linux version 6.6:
    
        [  414.344659] ================================
        [  414.345155] WARNING: inconsistent lock state
        [  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
        [  414.346221] --------------------------------
        [  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        [  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
        [  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: 
blk_mq_dispatch_rq_list+0x131c/0x1ee0
        [  414.351204] {IN-SOFTIRQ-W} state was registered at:
        [  414.351751]   lock_acquire+0x18d/0x460
        [  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
        [  414.352769]   __wake_up_common_lock+0x22/0x60
        [  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
        [  414.353829]   sbitmap_queue_clear+0xdd/0x270
        [  414.354338]   blk_mq_put_tag+0xdf/0x170
        [  414.354807]   __blk_mq_free_request+0x381/0x4d0
        [  414.355335]   blk_mq_free_request+0x28b/0x3e0
        [  414.355847]   __blk_mq_end_request+0x242/0xc30
        [  414.356367]   scsi_end_request+0x2c1/0x830
        [  414.345155] WARNING: inconsistent lock state
        [  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
        [  414.346221] --------------------------------
        [  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        [  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
        [  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: 
blk_mq_dispatch_rq_list+0x131c/0x1ee0
        [  414.351204] {IN-SOFTIRQ-W} state was registered at:
        [  414.351751]   lock_acquire+0x18d/0x460
        [  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
        [  414.352769]   __wake_up_common_lock+0x22/0x60
        [  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
        [  414.353829]   sbitmap_queue_clear+0xdd/0x270
        [  414.354338]   blk_mq_put_tag+0xdf/0x170
        [  414.354807]   __blk_mq_free_request+0x381/0x4d0
        [  414.355335]   blk_mq_free_request+0x28b/0x3e0
        [  414.355847]   __blk_mq_end_request+0x242/0xc30
        [  414.356367]   scsi_end_request+0x2c1/0x830
        [  414.356863]   scsi_io_completion+0x177/0x1610
        [  414.357379]   scsi_complete+0x12f/0x260
        [  414.357856]   blk_complete_reqs+0xba/0xf0
        [  414.358338]   __do_softirq+0x1b0/0x7a2
        [  414.358796]   irq_exit_rcu+0x14b/0x1a0
        [  414.359262]   sysvec_call_function_single+0xaf/0xc0
        [  414.359828]   asm_sysvec_call_function_single+0x1a/0x20
        [  414.360426]   default_idle+0x1e/0x30
        [  414.360873]   default_idle_call+0x9b/0x1f0
        [  414.361390]   do_idle+0x2d2/0x3e0
        [  414.361819]   cpu_startup_entry+0x55/0x60
        [  414.362314]   start_secondary+0x235/0x2b0
        [  414.362809]   secondary_startup_64_no_verify+0x18f/0x19b
        [  414.363413] irq event stamp: 428794
        [  414.363825] hardirqs last  enabled at (428793): [<ffffffff816bfd1c>] 
ktime_get+0x1dc/0x200
        [  414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] 
_raw_spin_lock_irq+0x47/0x50
        [  414.365629] softirqs last  enabled at (428444): [<ffffffff85474780>] 
__do_softirq+0x540/0x7a2
        [  414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] 
irq_exit_rcu+0x14b/0x1a0
        [  414.367425]
                       other info that might help us debug this:
        [  414.368194]  Possible unsafe locking scenario:
        [  414.368900]        CPU0
        [  414.369225]        ----
        [  414.369548]   lock(&sbq->ws[i].wait);
        [  414.370000]   <Interrupt>
        [  414.370342]     lock(&sbq->ws[i].wait);
        [  414.370802]
                        *** DEADLOCK ***
        [  414.371569] 5 locks held by kworker/u10:3/1152:
        [  414.372088]  #0: ffff88810130e938 
((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
        [  414.373180]  #1: ffff88810201fdb8 
((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: 
process_scheduled_works+0x3a3/0x13f0
        [  414.374384]  #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: 
blk_mq_run_hw_queue+0x637/0xa00
        [  414.375342]  #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, 
at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
        [  414.376377]  #4: ffff888106205a08 
(&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: 
blk_mq_dispatch_rq_list+0x1337/0x1ee0
        [  414.378607]
                       stack backtrace:
        [  414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 
6.6.0-07439-gba2303cacfda #6
        [  414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [  414.381177] Workqueue: writeback wb_workfn (flush-253:0)
        [  414.381805] Call Trace:
        [  414.382136]  <TASK>
        [  414.382429]  dump_stack_lvl+0x91/0xf0
        [  414.382884]  mark_lock_irq+0xb3b/0x1260
        [  414.383367]  ? __pfx_mark_lock_irq+0x10/0x10
        [  414.383889]  ? stack_trace_save+0x8e/0xc0
        [  414.384373]  ? __pfx_stack_trace_save+0x10/0x10
        [  414.384903]  ? graph_lock+0xcf/0x410
        [  414.385350]  ? save_trace+0x3d/0xc70
        [  414.385808]  mark_lock.part.20+0x56d/0xa90
        [  414.386317]  mark_held_locks+0xb0/0x110
        [  414.386791]  ? __pfx_do_raw_spin_lock+0x10/0x10
        [  414.387320]  lockdep_hardirqs_on_prepare+0x297/0x3f0
        [  414.387901]  ? _raw_spin_unlock_irq+0x28/0x50
        [  414.388422]  trace_hardirqs_on+0x58/0x100
        [  414.388917]  _raw_spin_unlock_irq+0x28/0x50
        [  414.389422]  __blk_mq_tag_busy+0x1d6/0x2a0
        [  414.389920]  __blk_mq_get_driver_tag+0x761/0x9f0
        [  414.390899]  blk_mq_dispatch_rq_list+0x1780/0x1ee0
        [  414.391473]  ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
        [  414.392070]  ? sbitmap_get+0x2b8/0x450
        [  414.392533]  ? __blk_mq_get_driver_tag+0x210/0x9f0
        [  414.393095]  __blk_mq_sched_dispatch_requests+0xd99/0x1690
        [  414.393730]  ? elv_attempt_insert_merge+0x1b1/0x420
        [  414.394302]  ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
        [  414.394970]  ? lock_acquire+0x18d/0x460
        [  414.395456]  ? blk_mq_run_hw_queue+0x637/0xa00
        [  414.395986]  ? __pfx_lock_acquire+0x10/0x10
        [  414.396499]  blk_mq_sched_dispatch_requests+0x109/0x190
        [  414.397100]  blk_mq_run_hw_queue+0x66e/0xa00
        [  414.397616]  blk_mq_flush_plug_list.part.17+0x614/0x2030
        [  414.398244]  ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
        [  414.398897]  ? writeback_sb_inodes+0x241/0xcc0
        [  414.399429]  blk_mq_flush_plug_list+0x65/0x80
        [  414.399957]  __blk_flush_plug+0x2f1/0x530
        [  414.400458]  ? __pfx___blk_flush_plug+0x10/0x10
        [  414.400999]  blk_finish_plug+0x59/0xa0
        [  414.401467]  wb_writeback+0x7cc/0x920
        [  414.401935]  ? __pfx_wb_writeback+0x10/0x10
        [  414.402442]  ? mark_held_locks+0xb0/0x110
        [  414.402931]  ? __pfx_do_raw_spin_lock+0x10/0x10
        [  414.403462]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
        [  414.404062]  wb_workfn+0x2b3/0xcf0
        [  414.404500]  ? __pfx_wb_workfn+0x10/0x10
        [  414.404989]  process_scheduled_works+0x432/0x13f0
        [  414.405546]  ? __pfx_process_scheduled_works+0x10/0x10
        [  414.406139]  ? do_raw_spin_lock+0x101/0x2a0
        [  414.406641]  ? assign_work+0x19b/0x240
        [  414.407106]  ? lock_is_held_type+0x9d/0x110
        [  414.407604]  worker_thread+0x6f2/0x1160
        [  414.408075]  ? __kthread_parkme+0x62/0x210
        [  414.408572]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
        [  414.409168]  ? __kthread_parkme+0x13c/0x210
        [  414.409678]  ? __pfx_worker_thread+0x10/0x10
        [  414.410191]  kthread+0x33c/0x440
        [  414.410602]  ? __pfx_kthread+0x10/0x10
        [  414.411068]  ret_from_fork+0x4d/0x80
        [  414.411526]  ? __pfx_kthread+0x10/0x10
        [  414.411993]  ret_from_fork_asm+0x1b/0x30
        [  414.412489]  </TASK>
    
        When interrupt is turned on while a lock holding by spin_lock_irq it
        throws a warning because of potential deadlock.
    
        blk_mq_prep_dispatch_rq
         blk_mq_get_driver_tag
          __blk_mq_get_driver_tag
           __blk_mq_alloc_driver_tag
            blk_mq_tag_busy -> tag is already busy
            // failed to get driver tag
         blk_mq_mark_tag_wait
          spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
          __add_wait_queue(wq, wait) -> wait queue active
          blk_mq_get_driver_tag
          __blk_mq_tag_busy
        -> 1) tag must be idle, which means there can't be inflight IO
           spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
           spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt 
accidentally
        -> 2) context must be preempt by IO interrupt to trigger deadlock.
    
        As shown above, the deadlock is not possible in theory, but the warning
        still need to be fixed.
    
        Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.
    
        Fixes: 4f1731df60f9 ("blk-mq: fix potential io hang by wrong 
'wake_batch'")
        Signed-off-by: Li Lingfeng <lilingfe...@huawei.com>
        Reviewed-by: Ming Lei <ming....@redhat.com>
        Reviewed-by: Yu Kuai <yuku...@huawei.com>
        Reviewed-by: Bart Van Assche <bvanass...@acm.org>
        Link: 
https://lore.kernel.org/r/20240815024736.2040971-1-lilingf...@huaweicloud.com
        Signed-off-by: Jens Axboe <ax...@kernel.dk>
    
    Signed-off-by: Ming Lei <ming....@redhat.com>
    
      PID: 408      TASK: ffff8eee0870ca00  CPU: 36   COMMAND: "kworker/36:1H"
       #0 [fffffe3861831e60] crash_nmi_callback at ffffffff97269e31
       #1 [fffffe3861831e68] nmi_handle at ffffffff972300bb
       #2 [fffffe3861831eb0] default_do_nmi at ffffffff97e9e000
       #3 [fffffe3861831ed0] exc_nmi at ffffffff97e9e211
       #4 [fffffe3861831ef0] end_repeat_nmi at ffffffff98001639
          [exception RIP: native_queued_spin_lock_slowpath+638]
          RIP: ffffffff97eb31ae  RSP: ffffb1c8cd2a4d40  RFLAGS: 00000046
          RAX: 0000000000000000  RBX: ffff8f2dffb34780  RCX: 0000000000940000
          RDX: 000000000000002a  RSI: 0000000000ac0000  RDI: ffff8eaed4eb81c0
          RBP: ffff8eaed4eb81c0   R8: 0000000000000000   R9: ffff8f2dffaf3438
          R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
          R13: 0000000000000024  R14: 0000000000000000  R15: ffffd1c8bfb24b80
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #5 [ffffb1c8cd2a4d40] native_queued_spin_lock_slowpath at 
ffffffff97eb31ae
       #6 [ffffb1c8cd2a4d60] _raw_spin_lock_irqsave at ffffffff97eb2730
       #7 [ffffb1c8cd2a4d70] __wake_up at ffffffff9737c02d
       #8 [ffffb1c8cd2a4da0] sbitmap_queue_wake_up at ffffffff9786c74d
       #9 [ffffb1c8cd2a4dc8] sbitmap_queue_clear at ffffffff9786cc97
      #10 [ffffb1c8cd2a4de8] __blk_mq_free_request at ffffffff977c6049
      #11 [ffffb1c8cd2a4e20] scsi_end_request at ffffffff97a6e821
      #12 [ffffb1c8cd2a4e50] scsi_io_completion at ffffffff97a6f796
      #13 [ffffb1c8cd2a4e88] _scsih_io_done at ffffffffc0d80f7f [mpt3sas]
      #14 [ffffb1c8cd2a4ee0] _base_process_reply_queue at ffffffffc0d7124b 
[mpt3sas]
      #15 [ffffb1c8cd2a4f38] _base_interrupt at ffffffffc0d7183b [mpt3sas]
      #16 [ffffb1c8cd2a4f40] __handle_irq_event_percpu at ffffffff9739bb3a
      #17 [ffffb1c8cd2a4f78] handle_irq_event at ffffffff9739bd88
      #18 [ffffb1c8cd2a4fa8] handle_edge_irq at ffffffff973a0dc3
      #19 [ffffb1c8cd2a4fc8] __common_interrupt at ffffffff9722dace
      #20 [ffffb1c8cd2a4ff0] common_interrupt at ffffffff97e9db0b
      --- <IRQ stack> ---
      #21 [ffffb1c8cd90fc68] asm_common_interrupt at ffffffff98000d62
          [exception RIP: _raw_spin_unlock_irq+20]
          RIP: ffffffff97eb2e84  RSP: ffffb1c8cd90fd18  RFLAGS: 00000283
          RAX: 0000000000000001  RBX: ffff8eafb68efb40  RCX: 0000000000000001
          RDX: 0000000000000008  RSI: 0000000000000061  RDI: ffff8eafb06c3c70
          RBP: ffff8eee7af43000   R8: ffff8eaed4eb81c8   R9: ffff8eaed4eb81c8
          R10: 0000000000000008  R11: 0000000000000008  R12: 0000000000000000
          R13: ffff8eafb06c3bd0  R14: ffff8eafb06c3bc0  R15: ffff8eaed4eb81c0
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      #22 [ffffb1c8cd90fd18] __blk_mq_get_driver_tag at ffffffff977c9431
      #23 [ffffb1c8cd90fd48] blk_mq_mark_tag_wait at ffffffff977ca58e
      #24 [ffffb1c8cd90fd98] blk_mq_dispatch_rq_list at ffffffff977ca737
      #25 [ffffb1c8cd90fe20] __blk_mq_sched_dispatch_requests at 
ffffffff977d049b
      #26 [ffffb1c8cd90fe60] blk_mq_sched_dispatch_requests at ffffffff977d05f3
      #27 [ffffb1c8cd90fe70] blk_mq_run_work_fn at ffffffff977c5416
      #28 [ffffb1c8cd90fe90] process_one_work at ffffffff9732f092
      #29 [ffffb1c8cd90fed8] worker_thread at ffffffff9732f660
      #30 [ffffb1c8cd90ff18] kthread at ffffffff97337b0d
      #31 [ffffb1c8cd90ff50] ret_from_fork at ffffffff97202c69
    
    __wake_up() is waiting on spin_lock_irqsave(&wq_head->lock). The lock is
    in RBP == ffff8eaed4eb81c0.
    
    But this spinlock has been already acquired by the same process before
    the IRQ stack:
      blk_mq_mark_tag_wait() -> spin_lock_irq(&wq->lock) (R15 == 
ffff8eaed4eb81c0)
    
    Notice _irq(), so how come the IRQ happens before spin_unlock_irq()?
    
    This is because IRQ were enabled in the following stack:
    
      +-> blk_mq_mark_tag_wait
          +-> spin_lock_irq(&wq->lock);
          +-> blk_mq_get_driver_tag
            +-> blk_mq_get_driver_tag
              +-> __blk_mq_get_driver_tag
                +-> __blk_mq_alloc_driver_tag
                  +-> blk_mq_tag_busy
                    +-> __blk_mq_tag_busy
                      +-> spin_unlock_irq(&tags->lock)
    
    And this is fixed by the current patch, using spin_unlock_irqsave()
    instead.
    
    https://virtuozzo.atlassian.net/browse/ACSSP-444
    https://virtuozzo.atlassian.net/browse/VSTOR-108134
    (cherry picked from CentOS9 Stream commit 
1b23028947e6e2db982b13ce0bd4150726337104)
    
    Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com>
    Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com>
    
    Feature: fix ms/fs
---
 block/blk-mq-tag.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index cc57e2dd9a0bb..2cafcf11ee8be 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -38,6 +38,7 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags,
 void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
        unsigned int users;
+       unsigned long flags;
        struct blk_mq_tags *tags = hctx->tags;
 
        /*
@@ -56,11 +57,11 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
                        return;
        }
 
-       spin_lock_irq(&tags->lock);
+       spin_lock_irqsave(&tags->lock, flags);
        users = tags->active_queues + 1;
        WRITE_ONCE(tags->active_queues, users);
        blk_mq_update_wake_batch(tags, users);
-       spin_unlock_irq(&tags->lock);
+       spin_unlock_irqrestore(&tags->lock, flags);
 }
 
 /*
_______________________________________________
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to