The commit is pushed to "branch-rh9-5.14.0-427.44.1.vz9.80.x-ovz" and will appear at g...@bitbucket.org:openvz/vzkernel.git after rh9-5.14.0-427.44.1.vz9.80.38 ------> commit 8cb7bdaed670b0e3333c47d9247a512517e16940 Author: Ming Lei <ming....@redhat.com> Date: Thu Aug 15 10:47:36 2024 +0800
ms/block: Fix lockdep warning in blk_mq_mark_tag_wait JIRA: https://issues.redhat.com/browse/RHEL-56837 ms commit b313a8c835516bdda85025500be866ac8a74e022 Author: Li Lingfeng <lilingfe...@huawei.com> Date: Thu Aug 15 10:47:36 2024 +0800 block: Fix lockdep warning in blk_mq_mark_tag_wait Lockdep reported a warning in Linux version 6.6: [ 414.344659] ================================ [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.356863] scsi_io_completion+0x177/0x1610 [ 414.357379] scsi_complete+0x12f/0x260 [ 414.357856] blk_complete_reqs+0xba/0xf0 [ 414.358338] __do_softirq+0x1b0/0x7a2 [ 414.358796] irq_exit_rcu+0x14b/0x1a0 [ 414.359262] sysvec_call_function_single+0xaf/0xc0 [ 414.359828] asm_sysvec_call_function_single+0x1a/0x20 [ 414.360426] default_idle+0x1e/0x30 [ 414.360873] default_idle_call+0x9b/0x1f0 [ 414.361390] do_idle+0x2d2/0x3e0 [ 414.361819] cpu_startup_entry+0x55/0x60 [ 414.362314] start_secondary+0x235/0x2b0 [ 414.362809] secondary_startup_64_no_verify+0x18f/0x19b [ 414.363413] irq event stamp: 428794 [ 414.363825] hardirqs last enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200 [ 414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50 [ 414.365629] softirqs last enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2 [ 414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0 [ 414.367425] other info that might help us debug this: [ 414.368194] Possible unsafe locking scenario: [ 414.368900] CPU0 [ 414.369225] ---- [ 414.369548] lock(&sbq->ws[i].wait); [ 414.370000] <Interrupt> [ 414.370342] lock(&sbq->ws[i].wait); [ 414.370802] *** DEADLOCK *** [ 414.371569] 5 locks held by kworker/u10:3/1152: [ 414.372088] #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0 [ 414.373180] #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0 [ 414.374384] #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00 [ 414.375342] #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.376377] #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0 [ 414.378607] stack backtrace: [ 414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6 [ 414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 414.381177] Workqueue: writeback wb_workfn (flush-253:0) [ 414.381805] Call Trace: [ 414.382136] <TASK> [ 414.382429] dump_stack_lvl+0x91/0xf0 [ 414.382884] mark_lock_irq+0xb3b/0x1260 [ 414.383367] ? __pfx_mark_lock_irq+0x10/0x10 [ 414.383889] ? stack_trace_save+0x8e/0xc0 [ 414.384373] ? __pfx_stack_trace_save+0x10/0x10 [ 414.384903] ? graph_lock+0xcf/0x410 [ 414.385350] ? save_trace+0x3d/0xc70 [ 414.385808] mark_lock.part.20+0x56d/0xa90 [ 414.386317] mark_held_locks+0xb0/0x110 [ 414.386791] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.387320] lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.387901] ? _raw_spin_unlock_irq+0x28/0x50 [ 414.388422] trace_hardirqs_on+0x58/0x100 [ 414.388917] _raw_spin_unlock_irq+0x28/0x50 [ 414.389422] __blk_mq_tag_busy+0x1d6/0x2a0 [ 414.389920] __blk_mq_get_driver_tag+0x761/0x9f0 [ 414.390899] blk_mq_dispatch_rq_list+0x1780/0x1ee0 [ 414.391473] ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10 [ 414.392070] ? sbitmap_get+0x2b8/0x450 [ 414.392533] ? __blk_mq_get_driver_tag+0x210/0x9f0 [ 414.393095] __blk_mq_sched_dispatch_requests+0xd99/0x1690 [ 414.393730] ? elv_attempt_insert_merge+0x1b1/0x420 [ 414.394302] ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10 [ 414.394970] ? lock_acquire+0x18d/0x460 [ 414.395456] ? blk_mq_run_hw_queue+0x637/0xa00 [ 414.395986] ? __pfx_lock_acquire+0x10/0x10 [ 414.396499] blk_mq_sched_dispatch_requests+0x109/0x190 [ 414.397100] blk_mq_run_hw_queue+0x66e/0xa00 [ 414.397616] blk_mq_flush_plug_list.part.17+0x614/0x2030 [ 414.398244] ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10 [ 414.398897] ? writeback_sb_inodes+0x241/0xcc0 [ 414.399429] blk_mq_flush_plug_list+0x65/0x80 [ 414.399957] __blk_flush_plug+0x2f1/0x530 [ 414.400458] ? __pfx___blk_flush_plug+0x10/0x10 [ 414.400999] blk_finish_plug+0x59/0xa0 [ 414.401467] wb_writeback+0x7cc/0x920 [ 414.401935] ? __pfx_wb_writeback+0x10/0x10 [ 414.402442] ? mark_held_locks+0xb0/0x110 [ 414.402931] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.403462] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.404062] wb_workfn+0x2b3/0xcf0 [ 414.404500] ? __pfx_wb_workfn+0x10/0x10 [ 414.404989] process_scheduled_works+0x432/0x13f0 [ 414.405546] ? __pfx_process_scheduled_works+0x10/0x10 [ 414.406139] ? do_raw_spin_lock+0x101/0x2a0 [ 414.406641] ? assign_work+0x19b/0x240 [ 414.407106] ? lock_is_held_type+0x9d/0x110 [ 414.407604] worker_thread+0x6f2/0x1160 [ 414.408075] ? __kthread_parkme+0x62/0x210 [ 414.408572] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.409168] ? __kthread_parkme+0x13c/0x210 [ 414.409678] ? __pfx_worker_thread+0x10/0x10 [ 414.410191] kthread+0x33c/0x440 [ 414.410602] ? __pfx_kthread+0x10/0x10 [ 414.411068] ret_from_fork+0x4d/0x80 [ 414.411526] ? __pfx_kthread+0x10/0x10 [ 414.411993] ret_from_fork_asm+0x1b/0x30 [ 414.412489] </TASK> When interrupt is turned on while a lock holding by spin_lock_irq it throws a warning because of potential deadlock. blk_mq_prep_dispatch_rq blk_mq_get_driver_tag __blk_mq_get_driver_tag __blk_mq_alloc_driver_tag blk_mq_tag_busy -> tag is already busy // failed to get driver tag blk_mq_mark_tag_wait spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait) __add_wait_queue(wq, wait) -> wait queue active blk_mq_get_driver_tag __blk_mq_tag_busy -> 1) tag must be idle, which means there can't be inflight IO spin_lock_irq(&tags->lock) -> lock B (hctx->tags) spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally -> 2) context must be preempt by IO interrupt to trigger deadlock. As shown above, the deadlock is not possible in theory, but the warning still need to be fixed. Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq. Fixes: 4f1731df60f9 ("blk-mq: fix potential io hang by wrong 'wake_batch'") Signed-off-by: Li Lingfeng <lilingfe...@huawei.com> Reviewed-by: Ming Lei <ming....@redhat.com> Reviewed-by: Yu Kuai <yuku...@huawei.com> Reviewed-by: Bart Van Assche <bvanass...@acm.org> Link: https://lore.kernel.org/r/20240815024736.2040971-1-lilingf...@huaweicloud.com Signed-off-by: Jens Axboe <ax...@kernel.dk> Signed-off-by: Ming Lei <ming....@redhat.com> PID: 408 TASK: ffff8eee0870ca00 CPU: 36 COMMAND: "kworker/36:1H" #0 [fffffe3861831e60] crash_nmi_callback at ffffffff97269e31 #1 [fffffe3861831e68] nmi_handle at ffffffff972300bb #2 [fffffe3861831eb0] default_do_nmi at ffffffff97e9e000 #3 [fffffe3861831ed0] exc_nmi at ffffffff97e9e211 #4 [fffffe3861831ef0] end_repeat_nmi at ffffffff98001639 [exception RIP: native_queued_spin_lock_slowpath+638] RIP: ffffffff97eb31ae RSP: ffffb1c8cd2a4d40 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff8f2dffb34780 RCX: 0000000000940000 RDX: 000000000000002a RSI: 0000000000ac0000 RDI: ffff8eaed4eb81c0 RBP: ffff8eaed4eb81c0 R8: 0000000000000000 R9: ffff8f2dffaf3438 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000024 R14: 0000000000000000 R15: ffffd1c8bfb24b80 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #5 [ffffb1c8cd2a4d40] native_queued_spin_lock_slowpath at ffffffff97eb31ae #6 [ffffb1c8cd2a4d60] _raw_spin_lock_irqsave at ffffffff97eb2730 #7 [ffffb1c8cd2a4d70] __wake_up at ffffffff9737c02d #8 [ffffb1c8cd2a4da0] sbitmap_queue_wake_up at ffffffff9786c74d #9 [ffffb1c8cd2a4dc8] sbitmap_queue_clear at ffffffff9786cc97 #10 [ffffb1c8cd2a4de8] __blk_mq_free_request at ffffffff977c6049 #11 [ffffb1c8cd2a4e20] scsi_end_request at ffffffff97a6e821 #12 [ffffb1c8cd2a4e50] scsi_io_completion at ffffffff97a6f796 #13 [ffffb1c8cd2a4e88] _scsih_io_done at ffffffffc0d80f7f [mpt3sas] #14 [ffffb1c8cd2a4ee0] _base_process_reply_queue at ffffffffc0d7124b [mpt3sas] #15 [ffffb1c8cd2a4f38] _base_interrupt at ffffffffc0d7183b [mpt3sas] #16 [ffffb1c8cd2a4f40] __handle_irq_event_percpu at ffffffff9739bb3a #17 [ffffb1c8cd2a4f78] handle_irq_event at ffffffff9739bd88 #18 [ffffb1c8cd2a4fa8] handle_edge_irq at ffffffff973a0dc3 #19 [ffffb1c8cd2a4fc8] __common_interrupt at ffffffff9722dace #20 [ffffb1c8cd2a4ff0] common_interrupt at ffffffff97e9db0b --- <IRQ stack> --- #21 [ffffb1c8cd90fc68] asm_common_interrupt at ffffffff98000d62 [exception RIP: _raw_spin_unlock_irq+20] RIP: ffffffff97eb2e84 RSP: ffffb1c8cd90fd18 RFLAGS: 00000283 RAX: 0000000000000001 RBX: ffff8eafb68efb40 RCX: 0000000000000001 RDX: 0000000000000008 RSI: 0000000000000061 RDI: ffff8eafb06c3c70 RBP: ffff8eee7af43000 R8: ffff8eaed4eb81c8 R9: ffff8eaed4eb81c8 R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000 R13: ffff8eafb06c3bd0 R14: ffff8eafb06c3bc0 R15: ffff8eaed4eb81c0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #22 [ffffb1c8cd90fd18] __blk_mq_get_driver_tag at ffffffff977c9431 #23 [ffffb1c8cd90fd48] blk_mq_mark_tag_wait at ffffffff977ca58e #24 [ffffb1c8cd90fd98] blk_mq_dispatch_rq_list at ffffffff977ca737 #25 [ffffb1c8cd90fe20] __blk_mq_sched_dispatch_requests at ffffffff977d049b #26 [ffffb1c8cd90fe60] blk_mq_sched_dispatch_requests at ffffffff977d05f3 #27 [ffffb1c8cd90fe70] blk_mq_run_work_fn at ffffffff977c5416 #28 [ffffb1c8cd90fe90] process_one_work at ffffffff9732f092 #29 [ffffb1c8cd90fed8] worker_thread at ffffffff9732f660 #30 [ffffb1c8cd90ff18] kthread at ffffffff97337b0d #31 [ffffb1c8cd90ff50] ret_from_fork at ffffffff97202c69 __wake_up() is waiting on spin_lock_irqsave(&wq_head->lock). The lock is in RBP == ffff8eaed4eb81c0. But this spinlock has been already acquired by the same process before the IRQ stack: blk_mq_mark_tag_wait() -> spin_lock_irq(&wq->lock) (R15 == ffff8eaed4eb81c0) Notice _irq(), so how come the IRQ happens before spin_unlock_irq()? This is because IRQ were enabled in the following stack: +-> blk_mq_mark_tag_wait +-> spin_lock_irq(&wq->lock); +-> blk_mq_get_driver_tag +-> blk_mq_get_driver_tag +-> __blk_mq_get_driver_tag +-> __blk_mq_alloc_driver_tag +-> blk_mq_tag_busy +-> __blk_mq_tag_busy +-> spin_unlock_irq(&tags->lock) And this is fixed by the current patch, using spin_unlock_irqsave() instead. https://virtuozzo.atlassian.net/browse/ACSSP-444 https://virtuozzo.atlassian.net/browse/VSTOR-108134 (cherry picked from CentOS9 Stream commit 1b23028947e6e2db982b13ce0bd4150726337104) Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> Feature: fix ms/fs --- block/blk-mq-tag.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index cc57e2dd9a0bb..2cafcf11ee8be 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -38,6 +38,7 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags, void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { unsigned int users; + unsigned long flags; struct blk_mq_tags *tags = hctx->tags; /* @@ -56,11 +57,11 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) return; } - spin_lock_irq(&tags->lock); + spin_lock_irqsave(&tags->lock, flags); users = tags->active_queues + 1; WRITE_ONCE(tags->active_queues, users); blk_mq_update_wake_batch(tags, users); - spin_unlock_irq(&tags->lock); + spin_unlock_irqrestore(&tags->lock, flags); } /* _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel