[patch added to 3.12-stable] workqueue: fix ghost PENDING flag while doing MQ IO

2016-05-02 Thread Jiri Slaby
From: Roman Pen 

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===

commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]   Tainted: G   O4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  601.348142] kworker/u129:5  D 880803077988 0  1636  2 0x
[  601.348519] Workqueue: ibnbd_server_fileio_wq 
ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  880803077988 88080466b900 8808033f9c80 
880803078000
[  601.349662]  880807c95000 7fff 815b0920 
880803077ad0
[  601.350333]  8808030779a0 815b01d5  
880803077a38
[  601.350965] Call Trace:
[  601.351203]  [] ? bit_wait+0x60/0x60
[  601.351444]  [] schedule+0x35/0x80
[  601.351709]  [] schedule_timeout+0x192/0x230
[  601.351958]  [] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [] ? ktime_get+0x37/0xa0
[  601.352446]  [] ? bit_wait+0x60/0x60
[  601.352688]  [] io_schedule_timeout+0xa4/0x110
[  601.352951]  [] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [] bit_wait_io+0x1b/0x70
[  601.353440]  [] __wait_on_bit+0x5d/0x90
[  601.353689]  [] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [] blkdev_fsync+0x1b/0x50
[  601.355193]  [] vfs_fsync_range+0x49/0xa0
[  601.355432]  [] blkdev_write_iter+0xca/0x100
[  601.355679]  [] __vfs_write+0xaa/0xe0
[  601.355925]  [] vfs_write+0xa9/0x1a0
[  601.356164]  [] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode= MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
   01
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec 
cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0  CPU#1
  -- ---
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  
 request B inserted
 STORE hctx->ctx_map[1] bit marked
 kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier , which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen 
Cc: Gioh Kim 
Cc: Michael Wang 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: linux-bl...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo 
Signed-off-by: Jiri Slaby 
---
 kernel/workqueue.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bb5f920268d7..2bc1257e420f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -622,6 +622,35 @@ static void set_work_pool_and_clear_pending(struct 
work_struct *work,
 */
smp_wmb();

[patch added to 3.12-stable] workqueue: fix ghost PENDING flag while doing MQ IO

2016-05-02 Thread Jiri Slaby
From: Roman Pen 

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===

commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]   Tainted: G   O4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  601.348142] kworker/u129:5  D 880803077988 0  1636  2 0x
[  601.348519] Workqueue: ibnbd_server_fileio_wq 
ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  880803077988 88080466b900 8808033f9c80 
880803078000
[  601.349662]  880807c95000 7fff 815b0920 
880803077ad0
[  601.350333]  8808030779a0 815b01d5  
880803077a38
[  601.350965] Call Trace:
[  601.351203]  [] ? bit_wait+0x60/0x60
[  601.351444]  [] schedule+0x35/0x80
[  601.351709]  [] schedule_timeout+0x192/0x230
[  601.351958]  [] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [] ? ktime_get+0x37/0xa0
[  601.352446]  [] ? bit_wait+0x60/0x60
[  601.352688]  [] io_schedule_timeout+0xa4/0x110
[  601.352951]  [] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [] bit_wait_io+0x1b/0x70
[  601.353440]  [] __wait_on_bit+0x5d/0x90
[  601.353689]  [] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [] blkdev_fsync+0x1b/0x50
[  601.355193]  [] vfs_fsync_range+0x49/0xa0
[  601.355432]  [] blkdev_write_iter+0xca/0x100
[  601.355679]  [] __vfs_write+0xaa/0xe0
[  601.355925]  [] vfs_write+0xa9/0x1a0
[  601.356164]  [] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode= MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
   01
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec 
cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0  CPU#1
  -- ---
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  
 request B inserted
 STORE hctx->ctx_map[1] bit marked
 kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier , which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen 
Cc: Gioh Kim 
Cc: Michael Wang 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: linux-bl...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo 
Signed-off-by: Jiri Slaby 
---
 kernel/workqueue.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bb5f920268d7..2bc1257e420f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -622,6 +622,35 @@ static void set_work_pool_and_clear_pending(struct 
work_struct *work,
 */
smp_wmb();
set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+   /*
+* The following mb guarantees that previous clear of a PENDING bit
+* will not be reordered