Re: [PATCH v3 0/2] scsi: mpt3sas: prevent oops in the shutdown/unload path

2018-03-05 Thread Mauricio Faria de Oliveira

Martin, James,

On 02/22/2018 01:07 AM, Martin K. Petersen wrote:

The first patch prevents the SCSI error handlers to run once the
shutdown/unload path starts.  This avoids an oops at least in the host
reset handler, on kernels with a recent patch, and also in the abort
handler on kernels without that patch.

The second patch goes a step further, and prevents the SCSI error
handlers to be called for SCSI IO commands running at the
shutdown/unload time (in case adapter interrupts are disabled before
they complete) by finishing them.

Notice the second patch only affects SCSI IO commands. Other command
types, e.g., Task Manangement Functions, which are running at the
shutdown/unload time still must be handled separately -- this is also
covered in patch 1.



Applied to 4.16/scsi-fixes. Thank you!


Thank you.

Are you OK with / would ack this patchset for stable (v4.14+) ?

I have the backports ready and can submit if you are OK with it.

Thanks,
Mauricio



[PATCH v2 1/2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

2018-02-16 Thread Mauricio Faria de Oliveira
This patch adds checks for 'ioc->remove_host' in the SCSI error
handlers, so not to access pointers/resources potentially freed
in the PCI shutdown/module unload path.  The error handlers may
be invoked after shutdown/unload, depending on other components.

This problem was observed with kexec on a system with a mpt3sas
based adapter and an infiniband adapter which takes long enough
to shutdown:

The mpt3sas driver finished shutting down / disabled interrupt
handling, thus some commands have not finished and timed out.

Since the system was still running (waiting for the infiniband
adapter to shutdown), the scsi error handler for task abort of
mpt3sas was invoked, and hit an oops -- either in scsih_abort()
because 'ioc->scsi_lookup' was NULL (without commit dbec4c90
"scsi: mpt3sas: lockless command submission"), or later up in
scsih_host_reset() (with or without that commit), because it
eventually called mpt3sas_base_get_iocstate().

After commit dbec4c90, the oops in scsih_abort() does not occur
anymore (_scsih_scsi_lookup_find_by_scmd() is no longer called),
but that commit is too big and out of the scope of linux-stable,
where this patch might help, so still go for the changes.

Also, this might help to prevent similar errors in the future,
in case code changes and possibly tries to access freed stuff.

Note the fix in scsih_host_reset() is still important anyway.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 74fca18..5ab3caf 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -2835,7 +2835,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
_scsih_tm_display_info(ioc, scmd);
 
sas_device_priv_data = scmd->device->hostdata;
-   if (!sas_device_priv_data || !sas_device_priv_data->sas_target) {
+   if (!sas_device_priv_data || !sas_device_priv_data->sas_target ||
+   ioc->remove_host) {
sdev_printk(KERN_INFO, scmd->device,
"device been deleted! scmd(%p)\n", scmd);
scmd->result = DID_NO_CONNECT << 16;
@@ -2898,7 +2899,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
_scsih_tm_display_info(ioc, scmd);
 
sas_device_priv_data = scmd->device->hostdata;
-   if (!sas_device_priv_data || !sas_device_priv_data->sas_target) {
+   if (!sas_device_priv_data || !sas_device_priv_data->sas_target ||
+   ioc->remove_host) {
sdev_printk(KERN_INFO, scmd->device,
"device been deleted! scmd(%p)\n", scmd);
scmd->result = DID_NO_CONNECT << 16;
@@ -2961,7 +2963,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
_scsih_tm_display_info(ioc, scmd);
 
sas_device_priv_data = scmd->device->hostdata;
-   if (!sas_device_priv_data || !sas_device_priv_data->sas_target) {
+   if (!sas_device_priv_data || !sas_device_priv_data->sas_target ||
+   ioc->remove_host) {
starget_printk(KERN_INFO, starget, "target been deleted! 
scmd(%p)\n",
scmd);
scmd->result = DID_NO_CONNECT << 16;
@@ -3019,7 +3022,7 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
ioc->name, scmd);
scsi_print_command(scmd);
 
-   if (ioc->is_driver_loading) {
+   if (ioc->is_driver_loading || ioc->remove_host) {
pr_info(MPT3SAS_FMT "Blocking the host reset\n",
ioc->name);
r = FAILED;
-- 
1.8.3.1



[PATCH v3 0/2] scsi: mpt3sas: prevent oops in the shutdown/unload path

2018-02-16 Thread Mauricio Faria de Oliveira
The first patch prevents the SCSI error handlers to run once the shutdown/unload
path starts.  This avoids an oops at least in the host reset handler, on kernels
with a recent patch, and also in the abort handler on kernels without that 
patch.

The second patch goes a step further, and prevents the SCSI error handlers to be
called for SCSI IO commands running at the shutdown/unload time (in case adapter
interrupts are disabled before they complete) by finishing them.

Notice the second patch only affects SCSI IO commands. Other command types, 
e.g.,
Task Manangement Functions, which are running at the shutdown/unload time still
must be handled separately -- this is also covered in patch 1.

The patchset is v3 as PATCH 1/2 is identical to its v2 submission, and PATCH 2/2
has been added (in its first version).

Thanks to Sreekanth Reddy and Bart Van Assche for review and contributions.

Mauricio Faria de Oliveira (2):
  scsi: mpt3sas: fix oops in error handlers after shutdown/unload
  scsi: mpt3sas: wait for and flush running commands on shutdown/unload

 drivers/scsi/mpt3sas/mpt3sas_base.c  |  8 
 drivers/scsi/mpt3sas/mpt3sas_base.h  |  3 +++
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 21 -
 3 files changed, 23 insertions(+), 9 deletions(-)

-- 
1.8.3.1



[PATCH 2/2] scsi: mpt3sas: wait for and flush running commands on shutdown/unload

2018-02-16 Thread Mauricio Faria de Oliveira
From: Sreekanth Reddy <sreekanth.re...@broadcom.com>

This patch finishes all outstanding SCSI IO commands (but not other
commands, e.g., task management) in the shutdown and unload paths.

It first waits for the commands to complete (this is done after
setting 'ioc->remove_host = 1 ', which prevents new commands to
be queued) then it flushes commands that might still be running.

This avoids triggering error handling (e.g., abort command) for
all commands possibly completed by the adapter after interrupts
disabled.

Suggested-by: Sreekanth Reddy <sreekanth.re...@broadcom.com>
Tested-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Signed-off-by: Sreekanth Reddy <sreekanth.re...@broadcom.com>
[mauricfo: introduced something in commit message.]
Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/mpt3sas/mpt3sas_base.c  |  8 
 drivers/scsi/mpt3sas/mpt3sas_base.h  |  3 +++
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 10 +-
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 59a87ca..0aafbfd 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -6297,14 +6297,14 @@ void mpt3sas_base_clear_st(struct MPT3SAS_ADAPTER *ioc,
 }
 
 /**
- * _wait_for_commands_to_complete - reset controller
+ * mpt3sas_wait_for_commands_to_complete - reset controller
  * @ioc: Pointer to MPT_ADAPTER structure
  *
  * This function is waiting 10s for all pending commands to complete
  * prior to putting controller in reset.
  */
-static void
-_wait_for_commands_to_complete(struct MPT3SAS_ADAPTER *ioc)
+void
+mpt3sas_wait_for_commands_to_complete(struct MPT3SAS_ADAPTER *ioc)
 {
u32 ioc_state;
 
@@ -6377,7 +6377,7 @@ void mpt3sas_base_clear_st(struct MPT3SAS_ADAPTER *ioc,
is_fault = 1;
}
_base_reset_handler(ioc, MPT3_IOC_PRE_RESET);
-   _wait_for_commands_to_complete(ioc);
+   mpt3sas_wait_for_commands_to_complete(ioc);
_base_mask_interrupts(ioc);
r = _base_make_ioc_ready(ioc, type);
if (r)
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
b/drivers/scsi/mpt3sas/mpt3sas_base.h
index 789bc42..99ccf83 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.h
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
@@ -1433,6 +1433,9 @@ void mpt3sas_base_update_missing_delay(struct 
MPT3SAS_ADAPTER *ioc,
 
 int mpt3sas_port_enable(struct MPT3SAS_ADAPTER *ioc);
 
+void
+mpt3sas_wait_for_commands_to_complete(struct MPT3SAS_ADAPTER *ioc);
+
 
 /* scsih shared API */
 struct scsi_cmnd *mpt3sas_scsih_scsi_lookup_get(struct MPT3SAS_ADAPTER *ioc,
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 5ab3caf..c2ea13c7 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4456,7 +4456,7 @@ static int _scsih_set_satl_pending(struct scsi_cmnd 
*scmd, bool pending)
st = scsi_cmd_priv(scmd);
mpt3sas_base_clear_st(ioc, st);
scsi_dma_unmap(scmd);
-   if (ioc->pci_error_recovery)
+   if (ioc->pci_error_recovery || ioc->remove_host)
scmd->result = DID_NO_CONNECT << 16;
else
scmd->result = DID_RESET << 16;
@@ -9742,6 +9742,10 @@ static void scsih_remove(struct pci_dev *pdev)
unsigned long flags;
 
ioc->remove_host = 1;
+
+   mpt3sas_wait_for_commands_to_complete(ioc);
+   _scsih_flush_running_cmds(ioc);
+
_scsih_fw_event_cleanup_queue(ioc);
 
spin_lock_irqsave(>fw_event_lock, flags);
@@ -9818,6 +9822,10 @@ static void scsih_remove(struct pci_dev *pdev)
unsigned long flags;
 
ioc->remove_host = 1;
+
+   mpt3sas_wait_for_commands_to_complete(ioc);
+   _scsih_flush_running_cmds(ioc);
+
_scsih_fw_event_cleanup_queue(ioc);
 
spin_lock_irqsave(>fw_event_lock, flags);
-- 
1.8.3.1



Re: [PATCH v2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

2018-02-15 Thread Mauricio Faria de Oliveira

Hi Sreekanth,

On 02/15/2018 03:48 AM, Sreekanth Reddy wrote:

During the shutdown time, I don't want the outstanding IOs to timeout due to
disabling of interrupts and go the TM path. So I wanted to clear out all the
Outstanding IOs in the shutdown path itself instead of clearing them in TM
path.


Ah. I see. I didn't realize that. Good catch.


But if there are already some TMs are outstanding while initiating the
shutdown operation then your patch looks good to me.


Ok. Thanks for the review.


May be can you include my set of changes along with your solution?


Sure. I'll send a v3 with both parts and due credit.

cheers,
Mauricio



Re: [PATCH v2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

2018-02-14 Thread Mauricio Faria de Oliveira

Hi Sreenkanth,

Thanks for reviewing.

On 02/12/2018 04:28 AM, Sreekanth Reddy wrote:

Instead of returning the scmd with DID_NO_CONNECT in TM path, can we wait
for some time (10 seconds) in shutdown/unload path for the outstanding
commands to complete and even then [if] the scmds are outstanding then return
all the outstanding scmds with DID_NO_CONNECT in the shutdown/unload path
itself as shown below,


Apparently there is a problem with that.

That is not enough for TM commands; it's only for SCSI IO commands.

The TM commands use the high-priority queue, and SCSI IO commands use
the SCSI IO queue.  But this patch only waits for and return commands
in the latter:

> +  mpt3sas_wait_for_commands_to_complete(ioc);
> +  _scsih_flush_running_cmds(ioc);

They only loop up until 'ioc->scsiio_depth', but TM commands get SMIDs
above that [see mpt3sas_base_free_smid() and mpt3sas_scsih_issue_tm()].

So, there's an exposure for abort/reset/other TM commands. For example,
the SCSI IO commands finish quickly in wait_for_commands_to_complete(),
then the shutdown function continues, disables interrupts, and release
pointers/memory.  Then, an outstanding abort/reset command is finished,
its completion interrupt is not serviced, and the SCSI EH for it kicks
in/continues, and so it escalates up, and hit that oops in host reset.

Is there any problem with the patch that I proposed?

It seems okay to have another check in error path (not hot/performance
path), and that check is already used in many other points in the code,
for the same reasons (exit early before the code attempts to use stuff
that might be released).

Thanks again,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center







test results

2018-02-01 Thread Mauricio Faria de Oliveira
The test-case results with PATCH v2.

scsih_abort()
=

Without patch:

[  362.669743] setting logging_level(0x1000)
[  362.705074] mpt3sas_cm0: skip free_smid/scsi_done scmd(c01fd4f2bd40)
[  363.956579] sd 16:0:1:0: [sdf] Synchronizing SCSI cache
[  363.956844] sd 16:0:0:0: [sde] Synchronizing SCSI cache
[  363.957147] mpt3sas_cm0: sending diag reset !!
[  365.073568] mpt3sas_cm0: diag reset: SUCCESS
[  365.090316] mpt3sas_cm0: sleep on shutdown
[  366.120209] mpt3sas_cm0: sleep on shutdown
...
[  373.419954] sd 16:0:1:0: attempting task abort! scmd(c01fd4f2bd40)
...
[  373.420256] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress!
[  373.420300] sd 16:0:1:0: task abort: FAILED scmd(c01fd4f2bd40)
[  373.459961] sd 16:0:1:0: attempting device reset! scmd(c01fd4f2bd40)
...
[  373.460271] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress!
[  373.460315] sd 16:0:1:0: device reset: FAILED scmd(c01fd4f2bd40)
[  373.460362] scsi target16:0:1: attempting target reset! 
scmd(c01fd4f2bd40)
...
[  373.460628] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress!
[  373.460673] scsi target16:0:1: target reset: FAILED 
scmd(c01fd4f2bd40)
[  373.460720] mpt3sas_cm0: attempting host reset! scmd(c01fd4f2bd40)
[  373.460764] sd 16:0:1:0: [sdf] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 
00 01 00
[  373.460821] Unable to handle kernel paging request for data at address 
0xd0003800817d
[  373.460876] Faulting instruction address: 0xd00017b169b8
[  373.460921] Oops: Kernel access of bad area, sig: 11 [#1]
...
[  373.462028] NIP [d00017b169b8] mpt3sas_base_get_iocstate+0x38/0xb0 
[mpt3sas]
[  373.462085] LR [d00017b1a3f0] 
mpt3sas_base_hard_reset_handler+0x1a0/0x6d0 [mpt3sas]
[  373.462138] Call Trace:
[  373.462160] [c01fdf5f7ab0] [d00017b1a3f0] 
mpt3sas_base_hard_reset_handler+0x1a0/0x6d0 [mpt3sas]
[  373.462225] [c01fdf5f7b80] [d00017b20f6c] 
scsih_host_reset+0x7c/0x100 [mpt3sas]
[  373.462281] [c01fdf5f7c00] [c076f34c] 
scsi_try_host_reset+0x5c/0x140
[  373.462335] [c01fdf5f7c40] [c077107c] 
scsi_eh_ready_devs+0x73c/0x960
[  373.462390] [c01fdf5f7d10] [c0772800] 
scsi_error_handler+0x4c0/0x4e0
[  373.462444] [c01fdf5f7dc0] [c0107ed4] kthread+0x164/0x1b0
[  373.462490] [c01fdf5f7e30] [c000b5a0] 
ret_from_kernel_thread+0x5c/0xbc
[  373.468473] Instruction dump:

With patch:

[  332.994654] setting logging_level(0x1000)
[  333.024246] mpt3sas_cm0: skip free_smid/scsi_done scmd(c03c97348140)
[  334.232041] sd 16:0:1:0: [sdf] Synchronizing SCSI cache
[  334.232324] sd 16:0:0:0: [sde] Synchronizing SCSI cache
[  334.232598] mpt3sas_cm0: sending diag reset !!
[  335.352533] mpt3sas_cm0: diag reset: SUCCESS
[  335.372075] mpt3sas_cm0: sleep on shutdown
[  336.440544] mpt3sas_cm0: sleep on shutdown
...
[  343.100179] sd 16:0:1:0: attempting task abort! scmd(c03c97348140)
...
[  343.100488] sd 16:0:1:0: device been deleted! scmd(c03c97348140)
[  343.100532] sd 16:0:1:0: task abort: SUCCESS scmd(c03c97348140)
[  343.140185] sd 16:0:1:0: [sdf] tag#0 FAILED Result: 
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[  343.140275] sd 16:0:1:0: [sdf] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 
00 01 00
[  343.140338] print_req_error: I/O error, dev sdf, sector 0


scsih_dev/target/host_reset()
===

Without patch:

[  101.077946] setting logging_level(0x1100)
[  101.110029] mpt3sas_cm0: skip free_smid/scsi_done scmd(c01fb7a2dd40)
[  102.356108] sd 16:0:1:0: [sdf] Synchronizing SCSI cache
[  102.356396] sd 16:0:0:0: [sde] Synchronizing SCSI cache
[  102.356647] mpt3sas_cm0: sending diag reset !!
[  103.476380] mpt3sas_cm0: diag reset: SUCCESS
[  103.492696] mpt3sas_cm0: sleep on shutdown
[  104.512914] mpt3sas_cm0: sleep on shutdown
...
[  111.732912] sd 16:0:1:0: attempting task abort! scmd(c01fb7a2dd40)
...
[  111.733202] mpt3sas_cm0: fail task abort scmd(c01fb7a2dd40)
[  111.733246] sd 16:0:1:0: task abort: FAILED scmd(c01fb7a2dd40)
[  111.772919] sd 16:0:1:0: attempting device reset! scmd(c01fb7a2dd40)
...
[  111.773177] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress!
[  111.773222] sd 16:0:1:0: device reset: FAILED scmd(c01fb7a2dd40)
[  111.773266] scsi target16:0:1: attempting target reset! 
scmd(c01fb7a2dd40)
...
[  111.773528] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress!
[  111.773574] scsi target16:0:1: target reset: FAILED 
scmd(c01fb7a2dd40)
[  111.773617] mpt3sas_cm0: attempting host reset! scmd(c01fb7a2dd40)
[  111.773662] sd 16:0:1:0: [sdf] tag#0 CDB: 

[PATCH v2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

2018-02-01 Thread Mauricio Faria de Oliveira
This patch adds checks for 'ioc->remove_host' in the SCSI error
handlers, so not to access pointers/resources potentially freed
in the PCI shutdown/module unload path.  The error handlers may
be invoked after shutdown/unload, depending on other components.

This problem was observed with kexec on a system with a mpt3sas
based adapter and an infiniband adapter which takes long enough
to shutdown:

The mpt3sas driver finished shutting down / disabled interrupt
handling, thus some commands have not finished and timed out.

Since the system was still running (waiting for the infiniband
adapter to shutdown), the scsi error handler for task abort of
mpt3sas was invoked, and hit an oops -- either in scsih_abort()
because 'ioc->scsi_lookup' was NULL (without commit dbec4c9040ed
("scsi: mpt3sas: lockless command submission")), or later up in
scsih_host_reset() (with or without that commit), because it
eventually called mpt3sas_base_get_iocstate().

After that commit, the oops in scsih_abort() does not occur
anymore (_scsih_scsi_lookup_find_by_scmd() is no longer called),
but that commit is too big and out of the scope of linux-stable,
where this patch might help, so still go for the changes.

Also, this might help to prevent similar errors in the future,
in case code changes and possibly tries to access freed stuff.

Note the fix in scsih_host_reset() is still important anyway.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
v2:
 - rebase on top of mkp/scsi.git's fixes branch
 - insert check for 'ioc->remove_host' in existing
   checks which already set DID_NO_CONNECT instead
   of duplicating that code. (helps with backports)
 - update commit message about that commit.

 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 74fca18..5ab3caf 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -2835,7 +2835,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
_scsih_tm_display_info(ioc, scmd);
 
sas_device_priv_data = scmd->device->hostdata;
-   if (!sas_device_priv_data || !sas_device_priv_data->sas_target) {
+   if (!sas_device_priv_data || !sas_device_priv_data->sas_target ||
+   ioc->remove_host) {
sdev_printk(KERN_INFO, scmd->device,
"device been deleted! scmd(%p)\n", scmd);
scmd->result = DID_NO_CONNECT << 16;
@@ -2898,7 +2899,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
_scsih_tm_display_info(ioc, scmd);
 
sas_device_priv_data = scmd->device->hostdata;
-   if (!sas_device_priv_data || !sas_device_priv_data->sas_target) {
+   if (!sas_device_priv_data || !sas_device_priv_data->sas_target ||
+   ioc->remove_host) {
sdev_printk(KERN_INFO, scmd->device,
"device been deleted! scmd(%p)\n", scmd);
scmd->result = DID_NO_CONNECT << 16;
@@ -2961,7 +2963,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
_scsih_tm_display_info(ioc, scmd);
 
sas_device_priv_data = scmd->device->hostdata;
-   if (!sas_device_priv_data || !sas_device_priv_data->sas_target) {
+   if (!sas_device_priv_data || !sas_device_priv_data->sas_target ||
+   ioc->remove_host) {
starget_printk(KERN_INFO, starget, "target been deleted! 
scmd(%p)\n",
scmd);
scmd->result = DID_NO_CONNECT << 16;
@@ -3019,7 +3022,7 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
ioc->name, scmd);
scsi_print_command(scmd);
 
-   if (ioc->is_driver_loading) {
+   if (ioc->is_driver_loading || ioc->remove_host) {
pr_info(MPT3SAS_FMT "Blocking the host reset\n",
ioc->name);
r = FAILED;
-- 
1.8.3.1



Re: [PATCH] test-case

2018-02-01 Thread Mauricio Faria de Oliveira

On 01/31/2018 08:50 PM, Bart Van Assche wrote:

I think it would be useful to have some variant of the above code in the kernel
tree. Are you familiar with the fault injection framework (see also
 and Documentation/fault-injection/fault-injection.txt)?


No, not yet.  That's very interesting.


Do you think that framework would be appropriate for controlling whether or not
the above code gets executed?


Yes, apparently it might be done w/ fail attributes to control the
wait-loop in the shutdown/unload hook, whether to call scsi_done(),
and whether to fail an error handler so it escalates to the next.

I'd be happy to take a look at it, if you/others are interested.
(just to make sure it does not end up rejected later because of
others' opinion. :-)

cheers,
Mauricio



Re: [PATCH] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

2018-02-01 Thread Mauricio Faria de Oliveira

On 01/31/2018 08:59 PM, Bart Van Assche wrote:

On Wed, 2018-01-31 at 17:48 -0200, Mauricio Faria de Oliveira wrote:

On 01/31/2018 05:06 PM, Bart Van Assche wrote:

Sorry but I think this patch introduces new race conditions. Have you



Can you detail the race conditions? As far as I can see, the only race
condition would be when an error handler is invoked very close in time
to the shutdown/unload handler, and as such miss the 'ioc->remove_host'
assignment (thus it continues running anyway, and hit the oops).



That's indeed the race I was referring to. [snip]


Okay.  I'm not sure that point invalidates this patch; let me explain.

First, IMHO that problem already exists in mpt3sas as it is now.

There are 17 checks for 'ioc->remove_host' to return early, in the
same manner.  AFAIK, those are subject to this same race condition.

  $ grep -r 'ioc->remove_host[^=]*$' drivers/scsi/mpt3sas/ | wc -l
  17

Second, I don't think this patch makes the driver any worse.

Even with the potential race condition (which already exists elsewhere
in the driver), it reduces the chance to hit an oops from _every time_
to only when the race condition occurs _and_ the error handler looses.

So, even not being completely safe (like other parts of the driver),
it still does help.


BTW, are you aware that your patch
does not seem to apply on Martin's tree (the fixes branch of
git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git):

> [snip]

Can you rebase this patch on top of Martin's tree?


Yes, sure. I'll rebase, test, and send v2. Thanks for the pointer.

cheers,
Mauricio



Re: [PATCH] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

2018-01-31 Thread Mauricio Faria de Oliveira

Bart,

Thanks for reviewing.

On 01/31/2018 05:06 PM, Bart Van Assche wrote:

Sorry but I think this patch introduces new race conditions. Have you


Can you detail the race conditions? As far as I can see, the only race
condition would be when an error handler is invoked very close in time
to the shutdown/unload handler, and as such miss the 'ioc->remove_host'
assignment (thus it continues running anyway, and hit the oops).

But this problem has never been reported, so I'd think that case would
be rare.  Not that it does not need to be perfectly solved, but _if_
that is the only problem with the patch (and _if_ I got that right),
then it would still be better than the current code in oops.

> considered to modify the mpt3sas driver such that interrupts are only
> disabled after all commands have finished?

Yes, I considered introducing waits in several points, but since the
driver is tearing down, the only point which it seem to make sense to
insert is right at the beginning of shutdown/unload -- but it seemed
not sufficient: imagine not all commands finish, which would block that
path, so we need a time out mechanism, but that does not guarantee
the call back won't occur later on (say, a while after we gave up),
so we would still need a guard at the callbacks.

I also considered checking for the particular pointer which is set
to NULL and cause the Oops, but that is a series of 10ish (small)
patches which must check 'ioc->scsi_lookup' in a spinlock, so it
seemed a bit bigger than what was needed.. and would still be prone
to hitting an oops due to another pointer at a later time.

So after thinking about those, I went with the simplest approach.

I'd be happy to consider the race conditions you thought of,
provided more detail.

thanks,
Mauricio



[PATCH] test-case

2018-01-31 Thread Mauricio Faria de Oliveira
This patch can be verified with this simple test-case,
which inserts a wait loop at the bottom of 'scsih_shutdown()'
and forces SCSI commands to timeout (skip 'scmd->scsi_done()').

It abuses the 'ioc->logging_level' parameter do to that, with:
- 0x1000: wait loop on scsih_shutdown() and skip scsi_done()
- 0x0100: force scsih_abort() to return FAILED early,
  so to run device/target/host reset.

Oops in scsih_abort()
=

Test-case:

# echo 0x1000 > /sys/module/mpt3sas/parameters/logging_level
# dd if=/dev/sdf of=/dev/null count=1 iflag=direct &
# kexec --force --append="$(cat /proc/cmdline)" 
--initrd=/boot/initrd.img-4.15.0 /boot/vmlinux-4.15.0

Without patch:

[  141.936251] setting logging_level(0x1000)
[  141.977920] mpt3sas_cm0: skip scsi_done scmd(a85f0166)
[  147.927561] sd 16:0:1:0: [sdf] Synchronizing SCSI cache
[  147.927831] sd 16:0:0:0: [sde] Synchronizing SCSI cache
[  147.928090] mpt3sas_cm0: sending diag reset !!
[  149.041346] mpt3sas_cm0: diag reset: SUCCESS
[  149.057985] mpt3sas_cm0: sleep on shutdown
[  150.098619] mpt3sas_cm0: sleep on shutdown
[  151.138571] mpt3sas_cm0: sleep on shutdown
...
[  171.938245] mpt3sas_cm0: sleep on shutdown
[  172.678231] sd 16:0:1:0: attempting task abort! 
scmd(a85f0166)
...
[  172.678545] Unable to handle kernel paging request for data at 
address 0x0008
[  172.678600] Faulting instruction address: 0xd0001789e8c0
[  172.678648] Oops: Kernel access of bad area, sig: 11 [#1]
...
[  172.679804] NIP [d0001789e8c0] scsih_abort+0xc0/0x290 [mpt3sas]
[  172.679854] LR [d0001789e8a8] scsih_abort+0xa8/0x290 [mpt3sas]
[  172.679903] Call Trace:
[  172.679926] [c01fed68fbc0] [d0001789e8a8] 
scsih_abort+0xa8/0x290 [mpt3sas] (unreliable)
[  172.679994] [c01fed68fc50] [c075a274] 
scmd_eh_abort_handler+0xc4/0x1a0
[  172.680053] [c01fed68fc90] [c00fea88] 
process_one_work+0x188/0x450
[  172.680109] [c01fed68fd20] [c00fede8] 
worker_thread+0x98/0x550
[  172.680157] [c01fed68fdc0] [c0107344] kthread+0x164/0x1b0
[  172.680206] [c01fed68fe30] [c000b6e0] 
ret_from_kernel_thread+0x5c/0x7c
[  172.680261] Instruction dump:
...

With patch:

[  233.259952] setting logging_level(0x1000)
[  233.290008] mpt3sas_cm0: skip scsi_done scmd(7ec97dda)
[  234.600934] sd 16:0:1:0: [sdf] Synchronizing SCSI cache
[  234.601222] sd 16:0:0:0: [sde] Synchronizing SCSI cache
[  234.601470] mpt3sas_cm0: sending diag reset !!
[  235.718433] mpt3sas_cm0: diag reset: SUCCESS
[  235.734534] mpt3sas_cm0: sleep on shutdown
[  236.805704] mpt3sas_cm0: sleep on shutdown
[  237.845708] mpt3sas_cm0: sleep on shutdown
...
[  263.845781] mpt3sas_cm0: sleep on shutdown
[  264.185782] sd 16:0:1:0: attempting task abort! 
scmd(7ec97dda)
...
[  264.186104] sd 16:0:1:0: task abort: SUCCESS scmd(7ec97dda)
[  264.225788] sd 16:0:1:0: [sdf] tag#0 FAILED Result: 
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[  264.225910] sd 16:0:1:0: [sdf] tag#0 CDB: Read(10) 28 00 00 00 00 00 
00 00 01 00
[  264.225969] print_req_error: I/O error, dev sdf, sector 0

Oops in scsih_host_reset()
==

Test-case:

# echo 0x1100 > /sys/module/mpt3sas/parameters/logging_level
# dd if=/dev/sdf of=/dev/null count=1 iflag=direct &
# kexec --force --append="$(cat /proc/cmdline)" 
--initrd=/boot/initrd.img-4.15.0 /boot/vmlinux-4.15.0

Without patch:

[  241.734670] setting logging_level(0x1100)
[  251.587765] mpt3sas_cm0: skip scsi_done scmd(60a524f9)
[  252.771054] sd 16:0:1:0: [sdf] Synchronizing SCSI cache
[  252.771335] sd 16:0:0:0: [sde] Synchronizing SCSI cache
[  252.771582] mpt3sas_cm0: sending diag reset !!
[  253.889275] mpt3sas_cm0: diag reset: SUCCESS
[  253.906315] mpt3sas_cm0: sleep on shutdown
[  254.966487] mpt3sas_cm0: sleep on shutdown
[  256.006417] mpt3sas_cm0: sleep on shutdown
...
[  282.005452] mpt3sas_cm0: sleep on shutdown
[  282.105416] sd 16:0:1:0: attempting task abort! 
scmd(60a524f9)
...
[  282.105707] mpt3sas_cm0: fail task abort scmd(60a524f9)
[  282.105754] sd 16:0:1:0: task abort: FAILED scmd(60a524f9)
[  282.105811] sd 16:0:1:0: attempting device reset! 
scmd(60a524f9)
...
[  282.106087] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in 
progress!
[  282.106136] sd 16:0:1:0: device reset: FAILED 

[PATCH] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

2018-01-31 Thread Mauricio Faria de Oliveira
This patch adds checks for 'ioc->remove_host' in the SCSI error
handlers, so not to access pointers/resources potentially freed
in the PCI shutdown/module unload path.  The error handlers may
be invoked after shutdown/unload, depending on other components.

This problem was observed with kexec on a system with a mpt3sas
based adapter and an infiniband adapter which takes long enough
to shutdown.  The mpt3sas driver finished shuttting down, which
disabled interruption handling, thus some running commands have
not finished and timed out.  Since the system was still running,
waiting for the infiniband adapter to shut down, the scsi error
handler for task abort of mpt3sas was invoked, and hit the oops
because 'ioc->scsi_lookup' was NULL.

During patch testing, a different oops happens if host reset is
reached (when it eventually calls 'mpt3sas_base_get_iocstate()').

The device reset and target reset handlers do not cause oopses,
but print a misleading message of host reset in progress, thus
fix those too.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b258f21..3c4e47c 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -3007,6 +3007,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
goto out;
}
 
+   if (ioc->remove_host) {
+   scmd->result = DID_NO_CONNECT << 16;
+   scmd->scsi_done(scmd);
+   r = SUCCESS;
+   goto out;
+   }
+
/* search for the command */
smid = _scsih_scsi_lookup_find_by_scmd(ioc, scmd);
if (!smid) {
@@ -3069,6 +3076,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
goto out;
}
 
+   if (ioc->remove_host) {
+   scmd->result = DID_NO_CONNECT << 16;
+   scmd->scsi_done(scmd);
+   r = SUCCESS;
+   goto out;
+   }
+
/* for hidden raid components obtain the volume_handle */
handle = 0;
if (sas_device_priv_data->sas_target->flags &
@@ -3131,6 +3145,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
goto out;
}
 
+   if (ioc->remove_host) {
+   scmd->result = DID_NO_CONNECT << 16;
+   scmd->scsi_done(scmd);
+   r = SUCCESS;
+   goto out;
+   }
+
/* for hidden raid components obtain the volume_handle */
handle = 0;
if (sas_device_priv_data->sas_target->flags &
@@ -3179,6 +3200,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER 
*ioc, u16 handle,
ioc->name, scmd);
scsi_print_command(scmd);
 
+   if (ioc->remove_host) {
+   scmd->result = DID_NO_CONNECT << 16;
+   scmd->scsi_done(scmd);
+   r = FAILED;
+   goto out;
+   }
+
if (ioc->is_driver_loading) {
pr_info(MPT3SAS_FMT "Blocking the host reset\n",
ioc->name);
-- 
1.8.3.1



Re: [PATCH v2 1/4] scsi: scsi_dh_alua: allow I/O in target port unavailable and standby states

2017-07-11 Thread Mauricio Faria de Oliveira

On 07/11/2017 12:32 PM, Mauricio Faria de Oliveira wrote:

Also, it seems the Unavailable/Standby states would not be logged
without a recheck from alua_check_sense(), since the only callers
of alua_rtpg_queue() are alua_activate() and alua_check[_sense]()


Well, actually it does get logged if when activating/switching path
groups but shouldn't be the case with only a single path group.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH v2 1/4] scsi: scsi_dh_alua: allow I/O in target port unavailable and standby states

2017-07-11 Thread Mauricio Faria de Oliveira

On 07/11/2017 06:18 AM, Hannes Reinecke wrote:

NACK.

The whole_point_  of having device handlers is to_avoid_  I/O errors
during booting.

And the ALUA checker is prepared to handle this situation properly.
The directio checker of course doesn't know about this, but then no-one
expected the directio checker to work with ALUA.


I lacked that more holistic understanding. Thanks for explaining.

Now, for the sake of logging/debugging...

Any problem with patches 2 and 4?

Also, it seems the Unavailable/Standby states would not be logged
without a recheck from alua_check_sense(), since the only callers
of alua_rtpg_queue() are alua_activate() and alua_check[_sense]()
[the call from alua_check_vpd() is only in the initialization path].

Isn't there a point in scheduling a recheck once those conditions
are found in alua_check_sense() to get them logged? - since valid
path checkers won't go through that function.

(and it occurred to me that the state-change check of patch 3 can
be done there, simpler.)

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH v2 0/4] scsi_dh_alua: fix stuck I/O after unavailable/standby states

2017-07-10 Thread Mauricio Faria de Oliveira

On 07/10/2017 07:47 PM, Mauricio Faria de Oliveira wrote:

This patchset addresses that problem, and adds a few improvements
to the logging of PG state changes.


Here are some kernel log snippets with the patchset, if that helps.

The 2 port groups temporarily gone into unavailable state, and
recovered to active states later on, when I/O has been restored.

The devices, port groups, and relative port IDs:
2 LUNs, 2 targets per LUN, 2 hosts (8 paths total).

[  294.482107] scsi 4:0:0:0: alua: device 
naa.60050764008100dac104 port group 1 rel port 881
[  294.642063] scsi 4:0:0:1: alua: device 
naa.60050764008100dac105 port group 0 rel port 881
[  294.812140] scsi 4:0:1:0: alua: device 
naa.60050764008100dac104 port group 0 rel port 81
[  295.002027] scsi 4:0:1:1: alua: device 
naa.60050764008100dac105 port group 1 rel port 81
[  296.952092] scsi 5:0:0:0: alua: device 
naa.60050764008100dac104 port group 1 rel port 881
[  297.162043] scsi 5:0:0:1: alua: device 
naa.60050764008100dac105 port group 0 rel port 881
[  297.312130] scsi 5:0:1:0: alua: device 
naa.60050764008100dac104 port group 0 rel port 81
[  297.512006] scsi 5:0:1:1: alua: device 
naa.60050764008100dac105 port group 1 rel port 81


First, mpatha goes down:

mpatha (360050764008100dac104) dm-6 IBM ,2145
size=40G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| |- 4:0:1:0 sde 8:64  active undef unknown
| `- 5:0:1:0 sdi 8:128 active undef unknown
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 4:0:0:0 sdc 8:32  active undef unknown
  `- 5:0:0:0 sdg 8:96  active undef unknown

[258929.940734] device-mapper: multipath: Failing path 8:64.
[258929.940751] device-mapper: multipath: Failing path 8:128.

[258929.940777] sd 4:0:0:0: alua: alua_activate(): port group 01, fn: 
pg_init_done [dm_multipath]()
[258929.940784] sd 5:0:0:0: alua: alua_activate(): port group 01, fn: 
pg_init_done [dm_multipath]()


[258929.941575] sd 5:0:1:0: alua: alua_check_sense(): Sense Key: 0x2, 
ASC: 0x4, ASCQ: 0xc
[258929.941578] sd 5:0:1:0: [sdi] tag#1 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE

[258929.941874] sd 5:0:1:0: [sdi] tag#1 Sense Key : Not Ready [current]
[258929.941928] sd 5:0:1:0: [sdi] tag#1 Add. Sense: Logical unit not 
accessible, target port in unavailable state


[258929.942096] sd 4:0:1:0: alua: alua_check_sense(): Sense Key: 0x2, 
ASC: 0x4, ASCQ: 0xc
[258929.942104] sd 4:0:1:0: [sde] tag#2 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE

[258929.942300] sd 4:0:1:0: [sde] tag#2 Sense Key : Not Ready [current]
[258929.942354] sd 4:0:1:0: [sde] tag#2 Add. Sense: Logical unit not 
accessible, target port in unavailable state


[258929.956347] sd 4:0:1:0: alua: port group 00 state U non-preferred 
supports tolusna

[258929.956389] sd 4:0:1:0: alua: port group 01 state U non-preferred

[258930.036355] sd 5:0:0:0: alua: alua_check_sense(): Sense Key: 0x2, 
ASC: 0x4, ASCQ: 0xc

[258930.036360] sd 5:0:0:0: alua: alua_rtpg_queue(): port group 01
[258930.036365] device-mapper: multipath: Failing path 8:96.

[258930.036444] sd 4:0:0:0: alua: alua_check_sense(): Sense Key: 0x2, 
ASC: 0x4, ASCQ: 0xc

[258930.036506] device-mapper: multipath: Failing path 8:32.

Then, mpathb goes down:

mpathb (360050764008100dac105) dm-7 IBM ,2145
size=40G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| |- 4:0:0:1 sdd 8:48  active undef unknown
| `- 5:0:0:1 sdh 8:112 active undef unknown
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 4:0:1:1 sdf 8:80  active undef unknown
  `- 5:0:1:1 sdj 8:144 active undef unknown


[258930.162558] sd 4:0:0:1: alua: alua_check_sense(): Sense Key: 0x2, 
ASC: 0x4, ASCQ: 0xc

[258930.162568] sd 4:0:0:1: alua: alua_rtpg_queue(): port group 00

[258930.162572] sd 4:0:0:1: [sdd] tag#9 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE

[258930.162731] sd 4:0:0:1: [sdd] tag#9 Sense Key : Not Ready [current]
[258930.162762] sd 4:0:0:1: [sdd] tag#9 Add. Sense: Logical unit not 
accessible, target port in unavailable state

[258930.162860] device-mapper: multipath: Failing path 8:48.

[258930.162979] sd 5:0:0:1: [sdh] tag#1 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE

[258930.162980] sd 5:0:0:1: [sdh] tag#1 Sense Key : Not Ready [current]
[258930.162982] sd 5:0:0:1: [sdh] tag#1 Add. Sense: Logical unit not 
accessible, target port in unavailable state

[258930.162992] device-mapper: multipath: Failing path 8:112.

[258930.163025] sd 4:0:1:1: alua: alua_activate(): port group 01, fn: 
pg_init_done [dm_multipath]()

[258930.163029] sd 4:0:1:1: alua: alua_rtpg_queue(): port group 01
[258930.163048] sd 5:0:1:1: alua: alua_activate(): port group 01, fn: 
pg_init_done [dm_multipath]()


[258930.176323] sd 4:0:0:1: alua: port group 00 state U non-preferred 
supports tolusna

[PATCH v2 2/4] scsi: scsi_dh_alua: print changes to RTPG state of all PGs

2017-07-10 Thread Mauricio Faria de Oliveira
Currently, alua_rtpg() can change the 'state' and 'preferred'
values for the current port group _and_ of other port groups.

However, it reports that _only_ for the current port group.

This might cause uncertainty and confusion when going through
the kernel logs for analyzing/debugging scsi_dh_alua behavior,
which is not helpful during support and development scenarios.

So, print of such changes for all port groups, when it occurs.

It's possible to distinguish between the messages printed for
the current and other PGs by the 'supports' (supported states)
part, which is only printed for the current PG.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>

---
v2:
- use lockdep_assert_held() instead of documenting locking conventions
  (Bart Van Assche <bart.vanass...@sandisk.com>)
- define two functions (with/without supported states information)
  (Bart Van Assche <bart.vanass...@sandisk.com>)
- simplify which device is used for printing the messages
  (use the evaluated scsi device and tell which PG the info is for)
- call the print functions from nearby places
  (right after modifying the PG data)

 drivers/scsi/device_handler/scsi_dh_alua.c | 63 +++---
 1 file changed, 49 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index a1cf3d6aa853..937341ddb767 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -529,6 +529,49 @@ static int alua_tur(struct scsi_device *sdev)
 }
 
 /*
+ * alua_rtpg_print - Print REPORT TARGET GROUP STATES information
+ *  (without supported states)
+ * @sdev: the device evaluated (source of information).
+ * @pg: the port group associated with the information.
+ */
+static void alua_rtpg_print(struct scsi_device *sdev,
+   struct alua_port_group *pg)
+{
+   lockdep_assert_held(>lock);
+
+   sdev_printk(KERN_INFO, sdev,
+   "%s: port group %02x state %c %s\n",
+   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
+   pg->pref ? "preferred" : "non-preferred");
+}
+
+/*
+ * alua_rtpg_print_supported - Print REPORT TARGET GROUP STATES information
+ *(with supported states)
+ * @sdev: the device evaluated (source of information).
+ * @pg: the port group associated with the information.
+ * @supported_states: the supported states information.
+ */
+static void alua_rtpg_print_supported(struct scsi_device *sdev,
+ struct alua_port_group *pg,
+ int supported_states)
+{
+   lockdep_assert_held(>lock);
+
+   sdev_printk(KERN_INFO, sdev,
+   "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n",
+   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
+   pg->pref ? "preferred" : "non-preferred",
+   supported_states & TPGS_SUPPORT_TRANSITION? 'T' : 't',
+   supported_states & TPGS_SUPPORT_OFFLINE   ? 'O' : 'o',
+   supported_states & TPGS_SUPPORT_LBA_DEPENDENT ? 'L' : 'l',
+   supported_states & TPGS_SUPPORT_UNAVAILABLE   ? 'U' : 'u',
+   supported_states & TPGS_SUPPORT_STANDBY   ? 'S' : 's',
+   supported_states & TPGS_SUPPORT_NONOPTIMIZED  ? 'N' : 'n',
+   supported_states & TPGS_SUPPORT_OPTIMIZED ? 'A' : 'a');
+}
+
+/*
  * alua_rtpg - Evaluate REPORT TARGET GROUP STATES
  * @sdev: the device to be evaluated.
  *
@@ -540,7 +583,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
 {
struct scsi_sense_hdr sense_hdr;
struct alua_port_group *tmp_pg;
-   int len, k, off, valid_states = 0, bufflen = ALUA_RTPG_SIZE;
+   int len, k, off, bufflen = ALUA_RTPG_SIZE;
unsigned char *desc, *buff;
unsigned err, retval;
unsigned int tpg_desc_tbl_off;
@@ -674,9 +717,12 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
h->sdev->access_state = desc[0];
}
rcu_read_unlock();
+
+   if (tmp_pg == pg)
+   alua_rtpg_print_supported(sdev, 
tmp_pg, desc[1]);
+   else
+   alua_rtpg_print(sdev, tmp_pg);
}
-   if (tmp_pg == pg)
-   valid_states = desc[1];
spin_unlock_irqrestore(_pg->lock, f

[PATCH v2 1/4] scsi: scsi_dh_alua: allow I/O in target port unavailable and standby states

2017-07-10 Thread Mauricio Faria de Oliveira
According to SPC-4 (5.15.2.4.5 Unavailable state), the unavailable
state may (or may not) transition to other states (e.g., microcode
downloading or hardware error, which may be temporary or permanent).

But, scsi_dh_alua currently fails I/O requests early on once that
state occurs (in alua_prep_fn()) preventing path checkers in such
function path to actually check if I/O still fails or now works.

And that prevents a path activation (alua_activate()) which could
update the PG state if it eventually recovered to an active state,
thus resume I/O. (This is also the case with the standby state.)

This might cause device-mapper multipath to fail all paths to some
storage system that moves the controllers to the unavailable state
for firmware upgrades, and never recover regardless of the storage
system doing upgrades one controller at a time and get them online.

Then I/O requests are blocked indefinitely due to queue_if_no_path
but the underlying individual paths are fully operational, and can
be verified as such through other function paths (e.g., SG_IO):

# multipath -l
mpatha (360050764008100dac100) dm-0 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 1:0:1:0 sdf 8:80  failed undef running
| `- 2:0:1:0 sdn 8:208 failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 1:0:0:0 sdb 8:16  failed undef running
  `- 2:0:0:0 sdj 8:144 failed undef running

# strace -e read \
sg_dd blk_sgio=0 \
if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \
2>&1 | grep 512
read(3, 0x3fff7ba8, 512) = -1 EIO (Input/output error)

# strace -e ioctl \
sg_dd blk_sgio=1 \
if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \
2>&1 | grep 512
ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[10]=[28, 00, 00, 00,
00, 00, 00, 00, 01, 00], <...>) = 0

So, allow I/O to paths of PGs in unavailable/standby state, so path
checkers can actually check them.

Also, schedule a recheck when unavailable/standby state is detected
(in alua_check_sense()) to update pg->state, and quiet further SCSI
error messages (in alua_prep_fn()).

Once a path checker eventually detects a working/active state again,
the PG state is normally updated on path activation (alua_activate(),
as it schedules a recheck), thus I/O requests are no longer failed.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Reported-by: Naresh Bannoth <nbann...@in.ibm.com>

---
v2:
- also add support for standby state to alua_check_sense(), alua_prep_fn()
  (Bart Van Assche <bart.vanass...@sandisk.com>)

 drivers/scsi/device_handler/scsi_dh_alua.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index c01b47e5b55a..a1cf3d6aa853 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -431,6 +431,26 @@ static int alua_check_sense(struct scsi_device *sdev,
alua_check(sdev, false);
return NEEDS_RETRY;
}
+   if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0b) {
+   /*
+* LUN Not Accessible - target port in standby state.
+*
+* Do not retry, so failover to another target port 
occur.
+* Schedule a recheck to update state for other 
functions.
+*/
+   alua_check(sdev, true);
+   return SUCCESS;
+   }
+   if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0c) {
+   /*
+* LUN Not Accessible - target port in unavailable 
state.
+*
+* Do not retry, so failover to another target port 
occur.
+* Schedule a recheck to update state for other 
functions.
+*/
+   alua_check(sdev, true);
+   return SUCCESS;
+   }
break;
case UNIT_ATTENTION:
if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00) {
@@ -1057,6 +1077,8 @@ static void alua_check(struct scsi_device *sdev, bool 
force)
  *
  * Fail I/O to all paths not in state
  * active/optimized or active/non-optimized.
+ * Allow I/O to paths in state unavailable/standby
+ * so path checkers can actually check them.
  */
 static int alua_prep_fn(struct scsi_device *sdev, struct request *req)
 {
@@ -1072,6 +1094,9 @@ static int alua_prep_fn(struct scsi_device *sdev, struct 
request *req)
rcu_read_unlock();
if (state ==

[PATCH v2 3/4] scsi: scsi_dh_alua: do not print RTPG state if it remains unavailable/standby

2017-07-10 Thread Mauricio Faria de Oliveira
Path checkers will check paths of a port group in unavailable/standby
state more frequently (as they are 'failed') - possibly for a long or
indefinite period of time, and/or for a large number of paths.

That might flood the kernel log with scsi_dh_alua RTPG state messages,
due to the recheck scheduled in alua_check_sense() to update PG state.

So, do not to print such message if unavailable/standby state remains
(i.e., the PG did not transition to/from such states). All other cases
continue to be printed.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>

---
v2:
- changed v1's alua_state_remains() into alua_rtpg_print_check(),
  which includes in the states (not) to print for (deduplicates code).
- added support for Standby state too (requested in patch 1 of this set)
  (Bart Van Assche <bart.vanass...@sandisk.com>)
- remove superfluous parentheses in return statement.
  (Bart Van Assche <bart.vanass...@sandisk.com>)
- remove likely() hint
  (Bart Van Assche <bart.vanass...@sandisk.com>)

 drivers/scsi/device_handler/scsi_dh_alua.c | 35 ++
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index 937341ddb767..e0bf0827a980 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -572,6 +572,30 @@ static void alua_rtpg_print_supported(struct scsi_device 
*sdev,
 }
 
 /*
+ * alua_rtpg_print_check - Whether to print RTPG state information
+ *on particular state transitions.
+ * @old_state: the old state value.
+ * @new_state: the new state value.
+ */
+static bool alua_rtpg_print_check(int old_state, int new_state)
+{
+   /*
+* Do not print RTPG state information in case
+* state remains either unavailable or standby.
+*
+* Print it for everything else.
+*/
+
+   switch (old_state) {
+   case SCSI_ACCESS_STATE_UNAVAILABLE:
+   case SCSI_ACCESS_STATE_STANDBY:
+   return old_state != new_state;
+   default:
+   return 1;
+   }
+}
+
+/*
  * alua_rtpg - Evaluate REPORT TARGET GROUP STATES
  * @sdev: the device to be evaluated.
  *
@@ -706,6 +730,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
if ((tmp_pg == pg) ||
!(tmp_pg->flags & ALUA_PG_RUNNING)) {
struct alua_dh_data *h;
+   int tmp_pg_state_orig = tmp_pg->state;
 
tmp_pg->state = desc[0] & 0x0f;
tmp_pg->pref = desc[0] >> 7;
@@ -718,10 +743,12 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
}
rcu_read_unlock();
 
-   if (tmp_pg == pg)
-   alua_rtpg_print_supported(sdev, 
tmp_pg, desc[1]);
-   else
-   alua_rtpg_print(sdev, tmp_pg);
+   if 
(alua_rtpg_print_check(tmp_pg_state_orig, tmp_pg->state)) {
+   if (tmp_pg == pg)
+   
alua_rtpg_print_supported(sdev, tmp_pg, desc[1]);
+   else
+   alua_rtpg_print(sdev, 
tmp_pg);
+   }
}
spin_unlock_irqrestore(_pg->lock, flags);
}
-- 
1.8.3.1



[PATCH v2 4/4] scsi: scsi_dh_alua: add sdev_dbg() to track alua_rtpg_work()

2017-07-10 Thread Mauricio Faria de Oliveira
Insert sdev_dbg() calls in the function path which may queue
alua_rtpg_work() past initialization, for debugging purposes:
- alua_activate()
- alua_check_sense()
- alua_rtpg_queue()

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index e0bf0827a980..5bf0d55f6eb1 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -422,6 +422,10 @@ static char print_alua_state(unsigned char state)
 static int alua_check_sense(struct scsi_device *sdev,
struct scsi_sense_hdr *sense_hdr)
 {
+   sdev_dbg(sdev, "%s: %s(): Sense Key: 0x%x, ASC: 0x%x, ASCQ: 0x%x\n",
+ALUA_DH_NAME, __func__, sense_hdr->sense_key,
+sense_hdr->asc, sense_hdr->ascq);
+
switch (sense_hdr->sense_key) {
case NOT_READY:
if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) {
@@ -987,10 +991,13 @@ static bool alua_rtpg_queue(struct alua_port_group *pg,
spin_unlock_irqrestore(>lock, flags);
 
if (start_queue) {
+
if (queue_delayed_work(alua_wq, >rtpg_work,
-   msecs_to_jiffies(ALUA_RTPG_DELAY_MSECS)))
+   msecs_to_jiffies(ALUA_RTPG_DELAY_MSECS))) {
+   sdev_dbg(sdev, "%s: %s(): port group %02x\n",
+ALUA_DH_NAME, __func__, pg->group_id);
sdev = NULL;
-   else
+   } else
kref_put(>kref, release_port_group);
}
if (sdev)
@@ -1100,6 +1107,9 @@ static int alua_activate(struct scsi_device *sdev,
rcu_read_unlock();
mutex_unlock(>init_mutex);
 
+   sdev_dbg(sdev, "%s: %s(): port group %02x, fn: %pf()\n",
+ALUA_DH_NAME, __func__, pg->group_id, fn);
+
if (alua_rtpg_queue(pg, sdev, qdata, true))
fn = NULL;
else
-- 
1.8.3.1



[PATCH v2 0/4] scsi_dh_alua: fix stuck I/O after unavailable/standby states

2017-07-10 Thread Mauricio Faria de Oliveira
Currently, scsi_dh_alua fails I/O requests early on once ALUA state
unavailable/standby occur, which prevents path checkers to actually
check if I/O still fails or now works.

Then I/O requests are blocked indefinitely due to queue_if_no_path
but the underlying individual paths are fully operational, and can
be verified as such otherways (e.g., SG_IO).

This patchset addresses that problem, and adds a few improvements
to the logging of PG state changes.

Patch 1 fixes the problem.
Patch 2 makes sure that state changes for all PGs are logged.
Patch 3 makes sure that state no-changes for PGs in unavailable/standby
are not logged - only changes are.
Patch 4 adds few sdev_dbg() calls to track the path to alua_rtpg_work()

Tested on v4.12+ (commit b4b8cbf679c4).

Mauricio Faria de Oliveira (4):
  scsi: scsi_dh_alua: allow I/O in target port unavailable and standby
states
  scsi: scsi_dh_alua: print changes to RTPG state of all PGs
  scsi: scsi_dh_alua: do not print RTPG state if it remains
unavailable/standby
  scsi: scsi_dh_alua: add sdev_dbg() to track alua_rtpg_work()

 drivers/scsi/device_handler/scsi_dh_alua.c | 129 +
 1 file changed, 113 insertions(+), 16 deletions(-)

-- 
1.8.3.1



Re: [PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

2017-04-11 Thread Mauricio Faria de Oliveira

This is the PATCH v2.  Sorry for the wrong subject line.

On 04/11/2017 11:46 AM, Mauricio Faria de Oliveira wrote:

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Acked-by: Brian King <brk...@linux.vnet.ibm.com>
---
v2:
 - use the scsi_cmd local variable rather than ipr_cmd->scsi_cmd dereference.
 - add Acked-by: Brian King.



--
Mauricio Faria de Oliveira
IBM Linux Technology Center



[PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

2017-04-11 Thread Mauricio Faria de Oliveira
On a dual controller setup with multipath enabled, some MEDIUM ERRORs
caused both paths to be failed, thus I/O got queued/blocked since the
'queue_if_no_path' feature is enabled by default on IPR controllers.

This example disabled 'queue_if_no_path' so the I/O failure is seen at
the sg_dd program.  Notice that after the sg_dd test-case, both paths
are in 'failed' state, and both path/priority groups are in 'enabled'
state (not 'active') -- which would block I/O with 'queue_if_no_path'.

# sg_dd if=/dev/dm-2 bs=4096 count=1 dio=1 verbose=4 blk_sgio=0
<...>
read(unix): count=4096, res=-1
sg_dd: reading, skip=0 : Input/output error
<...>

# dmesg
[...] sd 2:2:16:0: [sds] FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[...] sd 2:2:16:0: [sds] Sense Key : Medium Error [current]
[...] sd 2:2:16:0: [sds] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] sd 2:2:16:0: [sds] CDB: Read(10) 28 00 00 00 00 00 00 00 20 00
[...] blk_update_request: I/O error, dev sds, sector 0
[...] device-mapper: multipath: Failing path 65:32.
<...>
[...] device-mapper: multipath: Failing path 65:224.

# multipath -l
1IBM_IPR-0_59C2AE001F80 dm-2 IBM ,IPR-0   59C2AE00
size=5.2T features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 2:2:16:0 sds  65:32  failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 1:2:7:0  sdae 65:224 failed undef running

This is not the desired behavior. The dm-multipath explicitly checks
for the MEDIUM ERROR case (and a few others) so not to fail the path
(e.g., I/O to other sectors could potentially happen without problems).
See dm-mpath.c :: do_end_io_bio() -> noretry_error() !->! fail_path().

The problem trace is:

1) ipr_scsi_done()  // SENSE KEY/CHECK CONDITION detected, go to..
2) ipr_erp_start()  // ipr_is_gscsi() and masked_ioasc OK, go to..
3) ipr_gen_sense()  // masked_ioasc is IPR_IOASC_MED_DO_NOT_REALLOC,
// so set DID_PASSTHROUGH.

4) scsi_decide_disposition()  // check for DID_PASSTHROUGH and return
  // early on, faking a DID_OK.. *instead*
  // of reaching scsi_check_sense().

  // Had it reached the latter, that would
  // set host_byte to DID_MEDIUM_ERROR.

5) scsi_finish_command()
6) scsi_io_completion()
7) __scsi_error_from_host_byte()  // That would be converted to -ENODATA
<...>
8) dm_softirq_done()
9) multipath_end_io()
10) do_end_io()
11) noretry_error()  // And that is checked in dm-mpath :: noretry_error()
 // which would cause fail_path() not to be called.

With this patch applied, the I/O is failed but the paths are not.  This
multipath device continues accepting more I/O requests without blocking.
(and notice the different host byte/driver byte handling per SCSI layer).

# dmesg
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 
driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 
driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 10 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] Buffer I/O error on dev dm-6, logical block 0, async page read

# multipath -l 1IBM_IPR-0_59C2AE001F80
1IBM_IPR-0_59C2AE001F80 dm-6 IBM ,IPR-0   59C2AE00
size=5.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| `- 2:2:7:0  sdaf 65:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 1:2:7:0  sdh  8:112  active undef running

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Acked-by: Brian King <brk...@linux.vnet.ibm.com>
---
v2:
 - use the scsi_cmd local variable rather than ipr_cmd->scsi_cmd dereference.
 - add Acked-by: Brian King.

 drivers/scsi/ipr.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index b29afafc2885..5d5e272fd815 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -6293,7 +6293,12 @@ static void ipr_erp_start(struct ipr_ioa_cfg *ioa_cfg,
break;

Re: [PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

2017-04-11 Thread Mauricio Faria de Oliveira

On 04/11/2017 11:34 AM, Brian King wrote:

Looks good to me. Thanks for the detailed analysis. One minor nit below
and you can add:

Acked-by: Brian King <brk...@linux.vnet.ibm.com>


Cool, thanks. I'll submit a v2.


Since we already have a scsi_cmd local, you don't need to go back to the ipr_cmd
to get the pointer, so could be:


Thanks for catching that oversight.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center



[PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

2017-04-10 Thread Mauricio Faria de Oliveira
On a dual controller setup with multipath enabled, some MEDIUM ERRORs
caused both paths to be failed, thus I/O got queued/blocked since the
'queue_if_no_path' feature is enabled by default on IPR controllers.

This example disabled 'queue_if_no_path' so the I/O failure is seen at
the sg_dd program.  Notice that after the sg_dd test-case, both paths
are in 'failed' state, and both path/priority groups are in 'enabled'
state (not 'active') -- which would block I/O with 'queue_if_no_path'.

# sg_dd if=/dev/dm-2 bs=4096 count=1 dio=1 verbose=4 blk_sgio=0
<...>
read(unix): count=4096, res=-1
sg_dd: reading, skip=0 : Input/output error
<...>

# dmesg
[...] sd 2:2:16:0: [sds] FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[...] sd 2:2:16:0: [sds] Sense Key : Medium Error [current]
[...] sd 2:2:16:0: [sds] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] sd 2:2:16:0: [sds] CDB: Read(10) 28 00 00 00 00 00 00 00 20 00
[...] blk_update_request: I/O error, dev sds, sector 0
[...] device-mapper: multipath: Failing path 65:32.
<...>
[...] device-mapper: multipath: Failing path 65:224.

# multipath -l
1IBM_IPR-0_59C2AE001F80 dm-2 IBM ,IPR-0   59C2AE00
size=5.2T features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 2:2:16:0 sds  65:32  failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 1:2:7:0  sdae 65:224 failed undef running

This is not the desired behavior. The dm-multipath explicitly checks
for the MEDIUM ERROR case (and a few others) so not to fail the path
(e.g., I/O to other sectors could potentially happen without problems).
See dm-mpath.c :: do_end_io_bio() -> noretry_error() !->! fail_path().

The problem trace is:

1) ipr_scsi_done()  // SENSE KEY/CHECK CONDITION detected, go to..
2) ipr_erp_start()  // ipr_is_gscsi() and masked_ioasc OK, go to..
3) ipr_gen_sense()  // masked_ioasc is IPR_IOASC_MED_DO_NOT_REALLOC,
// so set DID_PASSTHROUGH.

4) scsi_decide_disposition()  // check for DID_PASSTHROUGH and return
  // early on, faking a DID_OK.. *instead*
  // of reaching scsi_check_sense().

  // Had it reached the latter, that would
  // set host_byte to DID_MEDIUM_ERROR.

5) scsi_finish_command()
6) scsi_io_completion()
7) __scsi_error_from_host_byte()  // That would be converted to -ENODATA
<...>
8) dm_softirq_done()
9) multipath_end_io()
10) do_end_io()
11) noretry_error()  // And that is checked in dm-mpath :: noretry_error()
 // which would cause fail_path() not to be called.

With this patch applied, the I/O is failed but the paths are not.  This
multipath device continues accepting more I/O requests without blocking.
(and notice the different host byte/driver byte handling per SCSI layer).

# dmesg
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 
driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 
driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 10 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] Buffer I/O error on dev dm-6, logical block 0, async page read

# multipath -l 1IBM_IPR-0_59C2AE001F80
1IBM_IPR-0_59C2AE001F80 dm-6 IBM ,IPR-0   59C2AE00
size=5.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| `- 2:2:7:0  sdaf 65:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 1:2:7:0  sdh  8:112  active undef running

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---

P.S.:  The switch-case checking this condition predates linux.git, so I'd
   like to ask Brian for a review too, as I imagine he probably might
   have worked with this back then (if memory may serve one this well :-)

 drivers/scsi/ipr.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index b29afafc2885..1012674d9dc5 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -6293,7 +6293,12 @@ static void ipr_e

Re: [PATCH 0/4] scsi: scsi_dh_alua: handle target port unavailable state

2017-04-10 Thread Mauricio Faria de Oliveira

On 04/10/2017 10:17 PM, Mauricio Faria de Oliveira wrote:

For documentation purposes, I'll reply to this cover letter with the analysis
of such cases of this problem, and the accompanying messages from kernel logs.


Here it goes, for anyone interested.
Scenario: 4 LUNs, 2 target port groups (PGs), 2 paths per PG.

mpatha (360050764008100dac100) dm-0 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler' 
hwhandler='1 alua' wp=rw

|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:1:0 sdf 8:80  active undef running
| `- 2:0:1:0 sdn 8:208 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 1:0:0:0 sdb 8:16  active undef running
  `- 2:0:0:0 sdj 8:144 active undef running

This LUN's active port group becomes Unavailable.
The ongoing I/O requests sense it via SCSI command errors.

	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: [sdf] tag#0 Add. 
Sense: Logical unit not accessible, target port in unavailable state
	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:1:0: [sdn] tag#0 Add. 
Sense: Logical unit not accessible, target port in unavailable state


Accordingly, dm-mpath fails both paths:

	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:80.
	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:208.


The SCSI command error has gone through alua_check_sense().
It scheduled an alua_rtpg(), which confirms its port group
(for 1:0:1:0) and the other (for 2:0:0:0) are unavailable.

	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 
state U non-preferred
	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 
state U non-preferred supports tolusna


Failing over, dm-mpath switched path groups, to the other/enabled port 
group.


Notice that further SCSI command errors _happen_, but are not _reported_,
as alua_rtpg() has set pg->state = unavailable before more I/O requests
eventually reached alua_prep_fn(), which sets them with RQF_QUIET.

Accordingly, dm-mpath fails both paths:

	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:144.
	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:16.


The path checkers are running, and constantly receiving senses of 
unavailable state
(seen through debug messages), and alua_rtpg() rechecks are scheduled, 
but no

repeated messages for 'state U' are printed, as it has not changed at all.

After a few minutes, both port groups become active 
(optimized/non-optimized),

and dm-mpath reinstate all paths:

	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:16.
	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:80.
	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:144.
	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:208.


The dm-mpath reinstate calls alua_activate() -> alua_rtpg() for both 
paths in the active port group.
The first alua_rtpg() worker is still running when the second 
alua_rtpg() is scheduled,
so pg->rtpg_sdev does not change, and it's reported as 1:0:1:0 twice 
(but called for 1:0:1:0 and 2:0:1:0).


	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 
state N non-preferred
	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 
state A non-preferred supports tolusna
	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 
state N non-preferred
	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 
state A non-preferred supports tolusna




...


mpathb (360050764008100dac101) dm-1 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler' 
hwhandler='1 alua' wp=rw

|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:0:1 sdc 8:32  active undef running
| `- 2:0:0:1 sdk 8:160 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 1:0:1:1 sdg 8:96  active undef running
  `- 2:0:1:1 sdo 8:224 active undef running


Similarly,

	Apr  5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: [sdk] tag#2 Add. 
Sense: Logical unit not accessible, target port in unavailable state
	Apr  5 19:59:52 ltcalpine-lp2 kernel: sd 1:0:0:1: [sdc] tag#5 Add. 
Sense: Logical unit not accessible, target port in unavailable state


	Apr  5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:160.
	Apr  5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:32.


In this case alua_rtpg() does not print messages for the 'other' port group
(without 'supports'), because the other PG worker was running at the time.
There was an alua_rtpg() scheduled due to alua_check_sense() for 2:0:0:1,
and another for alua_activate() of 1:0:1:1 (from the 'enabled' path group).

	Apr  5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: alua: port group 00 
state U non-preferred supports tolusna
	

[PATCH 3/4] scsi: scsi_dh_alua: print changes to RTPG state of other PGs too

2017-04-10 Thread Mauricio Faria de Oliveira
Currently, alua_rtpg() can change the 'state' and 'preferred'
values for the current port group _and_ of other port groups
present in the response buffer/descriptors.

However, it reports such changes _only_ for the current port
group (i.e., only for 'pg' but not for 'tmp_pg').

This might cause uncertainty and confusion when going through
the kernel logs for analyzing/debugging scsi_dh_alua behavior,
which is not helpful during support and development scenarios.

So, print such changes for other port groups than the current
one.

This requires finding a scsi_device to call sdev_printk() with
for that other port group. Using 'tmp_pg->rtpg_sdev' is not an
option because in that 'if' conditional the 'tmp_pg' is not in
the ALUA_PG_RUNNING state, so that pointer may be NULL. So the
for-loop over the tmp->pg device_handler structures is used to
find a valid scsi_device that is associated to this port group.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index 0d3508f2e285..c2c9173fd883 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -688,6 +688,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
if ((tmp_pg == pg) ||
!(tmp_pg->flags & ALUA_PG_RUNNING)) {
struct alua_dh_data *h;
+   struct scsi_device *tmp_pg_sdev = NULL;
 
tmp_pg->state = desc[0] & 0x0f;
tmp_pg->pref = desc[0] >> 7;
@@ -697,6 +698,17 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
/* h->sdev should always be 
valid */
BUG_ON(!h->sdev);
h->sdev->access_state = desc[0];
+
+   /*
+* If tmp_pg is not the running 
pg, and its worker
+* is not running, 
tmp_pg->rtpg_sdev might be NULL.
+* Use another/one associated 
scsi_device to print
+* its RTPG information.
+*/
+   if ((tmp_pg != pg) && 
!tmp_pg_sdev) {
+   tmp_pg_sdev = h->sdev;
+   
alua_rtpg_print(tmp_pg_sdev, tmp_pg, NULL);
+   }
}
rcu_read_unlock();
}
-- 
1.8.3.1



[PATCH 2/4] scsi: scsi_dh_alua: create alua_rtpg_print() for alua_rtpg() sdev_printk

2017-04-10 Thread Mauricio Faria de Oliveira
Factor out the sdev_printk() statement with the RTPG information
in alua_rtpg() into a new function, alua_rtpg_print(), that will
also be used in the following patch.

The only functional difference is that the 'valid_states' value
is now referenced via a pointer, and can be NULL (optional), in
which case the 'valid_states' information is not printed.  That
is for the following patch too.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 43 ++
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index 5e5a33cac951..0d3508f2e285 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -523,6 +523,37 @@ static int alua_tur(struct scsi_device *sdev)
 }
 
 /*
+ * alua_rtpg_print - Print REPORT TARGET GROUP STATES information
+ * @sdev: the device evaluated (or associated with this port group).
+ * @pg: the port group the information is associated with.
+ * @valid_states: pointer to valid_states value.
+ *(optional; e.g., obtained via another port group)
+ *
+ * Must be called with pg->lock held.
+ */
+static void alua_rtpg_print(struct scsi_device *sdev, struct alua_port_group 
*pg,
+   int *valid_states)
+{
+   if (valid_states)
+   sdev_printk(KERN_INFO, sdev,
+   "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n",
+   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
+   pg->pref ? "preferred" : "non-preferred",
+   (*valid_states) & TPGS_SUPPORT_TRANSITION ? 'T' : 't',
+   (*valid_states) & TPGS_SUPPORT_OFFLINE ? 'O' : 'o',
+   (*valid_states) & TPGS_SUPPORT_LBA_DEPENDENT ? 'L' : 'l',
+   (*valid_states) & TPGS_SUPPORT_UNAVAILABLE ? 'U' : 'u',
+   (*valid_states) & TPGS_SUPPORT_STANDBY ? 'S' : 's',
+   (*valid_states) & TPGS_SUPPORT_NONOPTIMIZED ? 'N' : 'n',
+   (*valid_states) & TPGS_SUPPORT_OPTIMIZED  ?  'A' : 'a');
+   else
+   sdev_printk(KERN_INFO, sdev,
+   "%s: port group %02x state %c %s\n",
+   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
+   pg->pref ? "preferred" : "non-preferred");
+}
+
+/*
  * alua_rtpg - Evaluate REPORT TARGET GROUP STATES
  * @sdev: the device to be evaluated.
  *
@@ -679,17 +710,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
}
 
spin_lock_irqsave(>lock, flags);
-   sdev_printk(KERN_INFO, sdev,
-   "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n",
-   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
-   pg->pref ? "preferred" : "non-preferred",
-   valid_states_SUPPORT_TRANSITION?'T':'t',
-   valid_states_SUPPORT_OFFLINE?'O':'o',
-   valid_states_SUPPORT_LBA_DEPENDENT?'L':'l',
-   valid_states_SUPPORT_UNAVAILABLE?'U':'u',
-   valid_states_SUPPORT_STANDBY?'S':'s',
-   valid_states_SUPPORT_NONOPTIMIZED?'N':'n',
-   valid_states_SUPPORT_OPTIMIZED?'A':'a');
+   alua_rtpg_print(sdev, pg, _states);
 
switch (pg->state) {
case SCSI_ACCESS_STATE_TRANSITIONING:
-- 
1.8.3.1



[PATCH 0/4] scsi: scsi_dh_alua: handle target port unavailable state

2017-04-10 Thread Mauricio Faria de Oliveira
This patch series resolves a problem in which all paths of a multipath device
became _permanently_ failed after a storage system had moved both controllers
into a _temporarily_ unavailable state (that is SCSI_ACCESS_STATE_UNAVAILABLE).

This happened because once scsi_dh_alua had set the 'pg->state' to that value,
any IO coming to that PG via alua_prep_fn() would be immediately failed there.

It was possible to confirm that IO coming to that PG by another function path
(e.g., SG_IO) would perform normally once that PG's respective storage system
controller had transitioned back to an active state.

- Patch 1 essentially resolves that problem by allowing IO requests coming in
  the SCSI_ACCESS_STATE_UNAVAILABLE to actually proceed in alua_prep_fn(). It
  also schedules a recheck in alua_check_sense() to update pg->state properly.
  The problem/debug test-case is included in its commit message for reference.

- Patch 2 and Patch 3 address uncertainty & potentially incorrect assumptions
  when trying to reconcile the alua: RTPG information in the kernel logs with
  the actual port groups state at a given point in time and to multipath/path
  checkers status/failed/reinstated messages, since scsi_dh_alua could update
  the PG state for the 'other' PG (i.e., not the PG by which the RTPG request
  was sent to) but only present an updated state message for the 'current' PG.

- Patch 4 silences the scsi_dh_alua messages about RTPG state/information for
  the unavailable state if it is no news (i.e., not a transition to/out-of it),
  only keeping the first and (potentially) last message (when it is some news).
  That's because during the period in which the unavailable state is in place,
  the path checkers will naturally have to go through alua_check_sense() path,
  which schedules a recheck and thus alua_rtpg() goes through the sdev_printk.

This patch series has been tested with the 4.11-rc4 kernel.

For documentation purposes, I'll reply to this cover letter with the analysis
of such cases of this problem, and the accompanying messages from kernel logs.

Mauricio Faria de Oliveira (4):
  scsi: scsi_dh_alua: allow I/O in the target port unavailable state
  scsi: scsi_dh_alua: create alua_rtpg_print() for alua_rtpg()
sdev_printk
  scsi: scsi_dh_alua: print changes to RTPG state of other PGs too
  scsi: scsi_dh_alua: do not print target port group state if it remains
unavailable

 drivers/scsi/device_handler/scsi_dh_alua.c | 99 ++
 1 file changed, 88 insertions(+), 11 deletions(-)

-- 
1.8.3.1



[PATCH 1/4] scsi: scsi_dh_alua: allow I/O in the target port unavailable state

2017-04-10 Thread Mauricio Faria de Oliveira
According to SPC-4 (5.15.2.4.5 Unavailable state), the unavailable
state may (or may not) transition to other states (e.g., microcode
downloading or hardware error, which may be temporary or permanent
conditions, respectively).

But, scsi_dh_alua currently fails the I/O requests early once that
state is established (in alua_prep_fn()), which provides no chance
for path checkers going through that function path to really check
whether the path actually still fails I/O requests or recovered to
an active state.

This might cause device-mapper multipath to fail all paths to some
storage system that moves the controllers to the unavailable state
for firmware upgrades, and never recover regardless of the storage
system doing upgrades one controller at a time and get them online.

Then I/O requests are blocked indefinitely due to queue_if_no_path
but the underlying individual paths are fully operational, and can
be verified as such through other function paths (e.g., SG_IO):

# multipath -l
mpatha (360050764008100dac100) dm-0 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 1:0:1:0 sdf 8:80  failed undef running
| `- 2:0:1:0 sdn 8:208 failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 1:0:0:0 sdb 8:16  failed undef running
  `- 2:0:0:0 sdj 8:144 failed undef running

# strace -e read \
sg_dd if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \
2>&1 | grep 512
read(3, 0x3fff7ba8, 512) = -1 EIO (Input/output error)

# strace -e ioctl \
sg_dd if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \
blk_sgio=1 \
2>&1 | grep 512
ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[10]=[28, 00, 00, 00,
00, 00, 00, 00, 01, 00], <...>) = 0

So, allow I/O to target port (groups) in the unavailable state, so the
path checkers can actually check them, and schedule a recheck whenever
the unavailable state is detected so pg->state can be updated properly
(and further SCSI IO error messages then silenced through alua_prep_fn()).

Once a path checker eventually detects an active state again, the port
group state will be updated by the path activation call, alua_activate(),
as it schedules an alua_rtpg() check.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Reported-by: Naresh Bannoth <nbann...@in.ibm.com>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index c01b47e5b55a..5e5a33cac951 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -431,6 +431,20 @@ static int alua_check_sense(struct scsi_device *sdev,
alua_check(sdev, false);
return NEEDS_RETRY;
}
+   if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0c) {
+   /*
+* LUN Not Accessible - target port in unavailable 
state.
+*
+* It may (not) be possible to transition to other 
states;
+* the transition might take a while or not happen at 
all,
+* depending on the storage system model, error type, 
etc.
+*
+* Do not retry, so failover to another target port 
occur.
+* Schedule a recheck to update state for other 
functions.
+*/
+   alua_check(sdev, true);
+   return SUCCESS;
+   }
break;
case UNIT_ATTENTION:
if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00) {
@@ -1057,6 +1071,8 @@ static void alua_check(struct scsi_device *sdev, bool 
force)
  *
  * Fail I/O to all paths not in state
  * active/optimized or active/non-optimized.
+ * Allow I/O to all paths in state unavailable
+ * so path checkers can actually check them.
  */
 static int alua_prep_fn(struct scsi_device *sdev, struct request *req)
 {
@@ -1072,6 +1088,8 @@ static int alua_prep_fn(struct scsi_device *sdev, struct 
request *req)
rcu_read_unlock();
if (state == SCSI_ACCESS_STATE_TRANSITIONING)
ret = BLKPREP_DEFER;
+   else if (state == SCSI_ACCESS_STATE_UNAVAILABLE)
+   req->rq_flags |= RQF_QUIET;
else if (state != SCSI_ACCESS_STATE_OPTIMAL &&
 state != SCSI_ACCESS_STATE_ACTIVE &&
 state != SCSI_ACCESS_STATE_LBA) {
-- 
1.8.3.1



[PATCH 4/4] scsi: scsi_dh_alua: do not print target port group state if it remains unavailable

2017-04-10 Thread Mauricio Faria de Oliveira
Path checkers will periodically check all paths to a target port group
in unavailable state more often (as they are 'failed'), possibly for a
long or indefinite period of time, or for a large number of paths.

That might end up flooding the kernel log with the scsi_dh_alua target
port group state messages, which are triggered by the rtpg recheck
scheduled in alua_check_sense().

So, modify alua_rtpg() not to print such message when that unavailable
state remains (i.e., the target port group state did not transition to
or from the unavailable state). All other cases continue to be printed.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index c2c9173fd883..8025e024c011 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -554,6 +554,17 @@ static void alua_rtpg_print(struct scsi_device *sdev, 
struct alua_port_group *pg
 }
 
 /*
+ * alua_state_remains - Whether a RTPG state remains the same across 2 values.
+ * @state: the state value to check for.
+ * @old_state: the old state value.
+ * @new_state: the new state value.
+ */
+static bool alua_state_remains(int state, int old_state, int new_state)
+{
+   return ((old_state == state) && (new_state == state));
+}
+
+/*
  * alua_rtpg - Evaluate REPORT TARGET GROUP STATES
  * @sdev: the device to be evaluated.
  *
@@ -571,6 +582,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
unsigned int tpg_desc_tbl_off;
unsigned char orig_transition_tmo;
unsigned long flags;
+   int orig_state;
 
if (!pg->expiry) {
unsigned long transition_tmo = ALUA_FAILOVER_TIMEOUT * HZ;
@@ -656,6 +668,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
goto retry;
}
 
+   orig_state = pg->state;
+
orig_transition_tmo = pg->transition_tmo;
if ((buff[4] & RTPG_FMT_MASK) == RTPG_FMT_EXT_HDR && buff[5] != 0)
pg->transition_tmo = buff[5];
@@ -689,6 +703,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
!(tmp_pg->flags & ALUA_PG_RUNNING)) {
struct alua_dh_data *h;
struct scsi_device *tmp_pg_sdev = NULL;
+   int tmp_pg_orig_state = tmp_pg->state;
 
tmp_pg->state = desc[0] & 0x0f;
tmp_pg->pref = desc[0] >> 7;
@@ -704,8 +719,11 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
 * is not running, 
tmp_pg->rtpg_sdev might be NULL.
 * Use another/one associated 
scsi_device to print
 * its RTPG information.
+* Don't print it if state 
remains 'unavailable'.
 */
-   if ((tmp_pg != pg) && 
!tmp_pg_sdev) {
+   if ((tmp_pg != pg) && 
!tmp_pg_sdev &&
+   
!alua_state_remains(SCSI_ACCESS_STATE_UNAVAILABLE,
+   
tmp_pg_orig_state, tmp_pg->state)) {
tmp_pg_sdev = h->sdev;

alua_rtpg_print(tmp_pg_sdev, tmp_pg, NULL);
}
@@ -722,7 +740,11 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
}
 
spin_lock_irqsave(>lock, flags);
-   alua_rtpg_print(sdev, pg, _states);
+
+   /* Print RTPG information (except if state remains 'unavailable'). */
+   if (likely(!alua_state_remains(SCSI_ACCESS_STATE_UNAVAILABLE,
+   orig_state, pg->state)))
+   alua_rtpg_print(sdev, pg, _states);
 
switch (pg->state) {
case SCSI_ACCESS_STATE_TRANSITIONING:
-- 
1.8.3.1



Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Mauricio Faria de Oliveira

On 04/05/2017 01:23 PM, Song Liu wrote:

Reviewed-by: Song Liu <songliubrav...@fb.com>


Thanks for reviewing, Song Liu.

It's good to know this patch doesn't break anything for you.

cheers,


--
Mauricio Faria de Oliveira
IBM Linux Technology Center



[PATCH v2] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Mauricio Faria de Oliveira
The commit 08024885a2a3 ("ses: Add power_status to SES device slot")
introduced the 'power_status' attribute to enclosure components and
the associated callbacks.

There are 2 callbacks available to get the power status of a device:
1) ses_get_power_status() for 'struct enclosure_component_callbacks'
2) get_component_power_status() for the sysfs device attribute
(these are available for kernel-space and user-space, respectively.)

However, despite both methods being available to get power status
on demand, that commit also introduced a call to get power status
in ses_enclosure_data_process().

This dramatically increased the total probe time for SCSI devices
on larger configurations, because ses_enclosure_data_process() is
called several times during the SCSI devices probe and loops over
the component devices (but that is another problem, another patch).

That results in a tremendous continuous hammering of SCSI Receive
Diagnostics commands to the enclosure-services device, which does
delay the total probe time for the SCSI devices __significantly__:

  Originally, ~34 minutes on a system attached to ~170 disks:

[ 9214.490703] mpt3sas version 13.100.00.00 loaded
...
[11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0),
   ordered(0), scsi_level(6), cmd_que(1)

  With this patch, it decreased to ~2.5 minutes -- a 13.6x faster

[ 1002.992533] mpt3sas version 13.100.00.00 loaded
...
[ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0),
   ordered(0), scsi_level(6), cmd_que(1)

Back to the commit discussion.. on the ses_get_power_status() call
introduced in ses_enclosure_data_process(): impact of removing it.

That may possibly be in place to initialize the power status value
on device probe.  However, those 2 functions available to retrieve
that value _do_ automatically refresh/update it.  So the potential
benefit would be a direct access of the 'power_status' field which
does not use the callbacks...

But the only reader of 'struct enclosure_component::power_status'
is the get_component_power_status() callback for sysfs attribute,
and it _does_ check for and call the .get_power_status callback,
(which indeed is defined and implemented by that commit), so the
power status value is, again, automatically updated.

So, the remaining potential for a direct/non-callback access to
the power_status attribute would be out-of-tree modules -- well,
for those, if they are for whatever reason interested in values
that are set during device probe and not up-to-date by the time
they need it.. well, that would be curious.

Well, to handle that more properly, set the initial power state
value to '-1' (i.e., uninitialized) instead of '1' (power 'on'),
and check for it in that callback which may do an direct access
to the field value _if_ a callback function is not defined.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot")
---

v2:
 - remove module parameter.
 - better return values for -1/uninitalized power_status.
   (thanks: Dan Williams <dan.j.willi...@intel.com>)

 drivers/misc/enclosure.c | 7 ++-
 drivers/scsi/ses.c   | 1 -
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed7146e9b..d3fe3ea902d4 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -148,7 +148,7 @@ struct enclosure_device *
for (i = 0; i < components; i++) {
edev->component[i].number = -1;
edev->component[i].slot = -1;
-   edev->component[i].power_status = 1;
+   edev->component[i].power_status = -1;
}
 
mutex_lock(_list_lock);
@@ -594,6 +594,11 @@ static ssize_t get_component_power_status(struct device 
*cdev,
 
if (edev->cb->get_power_status)
edev->cb->get_power_status(edev, ecomp);
+
+   /* If still uninitialized, the callback failed or does not exist. */
+   if (ecomp->power_status == -1)
+   return (edev->cb->get_power_status) ? -EIO : -ENOTTY;
+
return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off");
 }
 
diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index 50adabbb5808..f1cdf32d7514 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -548,7 +548,6 @@ static void ses_enclosure_data_process(struct 
enclosure_device *edev,
ecomp = >component[components++];
 
if (!IS_ERR(ecomp)) {
-   ses_get_power_status(edev, ecomp);
if (addl_desc_ptr)
ses_process_descriptor(
ecomp,
-- 
1.8.3.1



Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Mauricio Faria de Oliveira

On 04/05/2017 11:41 AM, Dan Williams wrote:

On Wed, Apr 5, 2017 at 6:13 AM, Mauricio Faria de Oliveira
<mauri...@linux.vnet.ibm.com> wrote:



1) imagine .get_power_status couldn't update the 'power_status' field
   (it's a bit unlikely with the in-tree ses driver, but in the case
that ses_get_page2_descriptor() returns NULL, ses_get_power_status()
doesn't update 'power_status').

Ok, in that case we should probably return -EIO, if get_power_status
is non-NULL, but the power_status is still -1.


2) an out-of-tree module employs another enclosure_component_callbacks
   structure, which doesn't implement a .get_power_status hook.

If get_power_status is null and power_status is -1 I think we should
return -ENOTTY to indicate the failure.


3) or it does, but that failed, and didn't update 'power_status' (e.g.,
   like case 1)

I think we're back -EIO for this.


Absolutely. I see those are better values to convey the failure/reason.



However, for an in-tree usage, it's clear that just dropping that line
seemed to be sufficient -- the rest in the patch is just consideration
for whoever might be using it in out-of-tree ways.

(and if such consideration is deemed not to be required in this case,
 I'd be fine with dropping the related changes and making this patch
 a one-line. :- )

Let's not carry module parameters that only theoretically help an out
of tree module. If such a driver exists its owner can just holler if
this policy change breaks their usage, and then we can have a
discussion about why their driver isn't upstream already.


Okay, got it.
For the record, the patch author has been in To:/Cc:, for such cases.

I'll submit a v2.

Thanks,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [lpfc 05/19] Fix driver unload/reload operation.

2017-04-05 Thread Mauricio Faria de Oliveira

Hi Dick,

If you don't mind me being pedantic, I've noticed a couple of things.

First, is your different email addresses in from/signed-off-by lines:

On 04/04/2017 02:16 PM, Dick Kennedy wrote:

From: Dick Kennedy <rkenn...@lvnvda1400.lvn.broadcom.net>

[...]

Signed-off-by: Dick Kennedy <dick.kenn...@broadcom.com>


Second, is there are several indentation/line-break changes in this
'fix' patch too, of which at least one seems inconsistent in the code
(see below) -- that would be better split off to another patch IMHO.

Maybe that would be covered in the scrubbing process you mentioned in
the other/regression thread.



@@ -10312,6 +10315,7 @@ lpfc_pci_probe_one_s3(struct pci_dev *pdev, const 
struct pci_device_id *pid)
}

/* Initialize and populate the iocb list per host */
+
error = lpfc_init_iocb_list(phba, LPFC_IOCB_LIST_CNT);
if (error) {
lpfc_printf_log(phba, KERN_ERR, LOG_INIT,
@@ -11092,7 +11096,6 @@ lpfc_pci_probe_one_s4(struct pci_dev *pdev, const 
struct pci_device_id *pid)
}

/* Initialize and populate the iocb list per host */
-
lpfc_printf_log(phba, KERN_INFO, LOG_INIT,
"2821 initialize iocb list %d.\n",
phba->cfg_iocb_cnt*1024);



cheers,


--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Mauricio Faria de Oliveira

Hi Dan,

Thanks for reviewing.

On 04/04/2017 06:07 PM, Dan Williams wrote:

@@ -594,6 +594,10 @@ static ssize_t get_component_power_status(struct device 
*cdev,

if (edev->cb->get_power_status)
edev->cb->get_power_status(edev, ecomp);
+
+   if (ecomp->power_status == -1)
+   return -EINVAL;
+

Can we ever hit this if get_power_status is non-null?


I admit that is cautionary, and more for consistency with the chance
of an uninitialized state being still in place, so not to return any
(incorrect) value in that state.

But, yes, in a few cases -- although unlikely.

1) imagine .get_power_status couldn't update the 'power_status' field
   (it's a bit unlikely with the in-tree ses driver, but in the case
that ses_get_page2_descriptor() returns NULL, ses_get_power_status()
doesn't update 'power_status').

2) an out-of-tree module employs another enclosure_component_callbacks
   structure, which doesn't implement a .get_power_status hook.

3) or it does, but that failed, and didn't update 'power_status' (e.g.,
   like case 1)



+static bool power_status_on_probe; /* default: false, 0. */
+module_param(power_status_on_probe, bool, 0644);
+MODULE_PARM_DESC(power_status_on_probe, "get power status of SES device slots "
+   "(enclosure components) at probe time "
+   "(warning: may delay total probe 
time)");
+

I don't think we need this module parameter as long as the only way to
read the power status is through sysfs. We can always just leave it to
be read on-demand when needed.


I agree for the in-tree module.  My only concern, as described in the
commit message, is the case someone is using an out-of-tree module /
has an implementation that depends on that (e.g., reading the 'power_
status' field directly, instead of using .get_power_status.

I've been a bit overzealous when considering why that call was inserted
in ses_enclosure_data_process(), and didn't want to break them.

However, for an in-tree usage, it's clear that just dropping that line
seemed to be sufficient -- the rest in the patch is just consideration
for whoever might be using it in out-of-tree ways.

(and if such consideration is deemed not to be required in this case,
 I'd be fine with dropping the related changes and making this patch
 a one-line. :- )

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown

2017-04-04 Thread Mauricio Faria de Oliveira

Hi Martin and Junichi,

On 04/03/2017 11:10 PM, Junichi Nomura wrote:

On 04/04/17 06:53, Mauricio Faria de Oliveira wrote:



On 03/28/2017 11:29 PM, Junichi Nomura wrote:

Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"),
"rmmod lpfc" starting to cause panic or corruption due to double free.



Thanks for the report. Can you please check whether the patch just sent
([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it?



It works for me. Thank you!


Excellent, thanks!

Martin, can you review/consider it for 4.11-rc6, please?


Considering future maintenance, it might be a bit fragile to just depend
on the code comment about representing the relation between cq/wq and
shared pring but it's maintainers' choice.


I agree -- there should be a better way of identifying a bound WQ/CQ.
Perhaps there is, but I couldn't find it currently.

For now, as far as I could grep and examine the code (detailed in commit
message), a WQ is always bound to a CQ, so to check for WQ and not free
its ring pointer seems to be sufficient (as the CQ ring pointer is freed
first).

If that changes, probably some form of flagging and/or queue type
determination would be better/necessary.

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown

2017-04-03 Thread Mauricio Faria de Oliveira

Hi Junichi,

On 03/28/2017 11:29 PM, Junichi Nomura wrote:

Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"),
"rmmod lpfc" starting to cause panic or corruption due to double free.


Thanks for the report. Can you please check whether the patch just sent
([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it?

I don't have a setup to test it handy right now.

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



[PATCH] lpfc: fix double free of bound CQ/WQ ring pointer

2017-04-03 Thread Mauricio Faria de Oliveira
commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications")
binds the CQs and WQs ring pointer (sets it to same address on both).

lpfc_create_wq_cq():
...
rc = lpfc_cq_create(phba, cq, eq, <...>)
...
rc = lpfc_wq_create(phba, wq, cq, qtype);
...
/* Bind this CQ/WQ to the NVME ring */
pring = wq->pring;
...
cq->pring = pring;
...

The commit frees both CQ & WQ for FCP/NVME on lpfc_sli4_queue_destroy(),
which causes a double free (potential corruption or panic) on freeing
the ring pointer of the second entity (CQ is first, WQ is second):

lpfc_pci_remove_one() # that is, .remove / .shutdown
-> lpfc_pci_remove_one_s4()
   -> lpfc_sli4_hba_unset()
  -> lpfc_sli4_queue_destroy()

 -> lpfc_sli4_release_queues()  # Release FCP/NVME cqs
-> __lpfc_sli4_release_queue()
   -> lpfc_sli4_queue_free()
  -> kfree(queue->pring)  # first free

 -> lpfc_sli4_release_queues()  # Release FCP/NVME wqs
-> __lpfc_sli4_release_queue()
   -> lpfc_sli4_queue_free()
  -> kfree(queue->pring)  # second free

So, check for WQs in lpfc_sli4_queue_free() and do not free the pring,
as it is freed before in the bound CQ.  [the WQs are created only via
lpfc_wq_create(), which sets struct lpfc_queue::type == LPFC_WQ.  And
that happens in 2 sites (lpfc_create_wq_cq() & lpfc_fof_queue_setup()),
both of which bind the CQs & WQs. Thus, checking for the LPFC_WQ type
correlates to whether the WQ is bound to a CQ, which is freed first.]

Additional details:

For reference, that binding also occurs on one other function:

lpfc_fof_queue_setup():
...
rc = lpfc_cq_create(phba, phba->sli4_hba.oas_cq, <...>)
...
rc = lpfc_wq_create(phba, phba->sli4_hba.oas_wq, <...>)
...
/* Bind this CQ/WQ to the NVME ring */
pring = phba->sli4_hba.oas_wq->pring;
...
phba->sli4_hba.oas_cq->pring = pring;

And used to occur similarly on lpfc_sli4_queue_setup(), but was changed
by that commit; although the problem is more related to the new freeing
pattern introduced in lpfc_sli4_queue_destroy() plus the bound CQs/WQs.

-   /* Bind this WQ to the next FCP ring */
-   pring = >ring[MAX_SLI3_CONFIGURED_RINGS + fcp_wqidx];
...
-   phba->sli4_hba.fcp_cq[fcp_wqidx]->pring = pring;

commit 85e8a23936ab ("scsi: lpfc: Add shutdown method for kexec") made
this more likely as lpfc_pci_remove_one() is called on driver shutdown
(e.g., modprobe -r / rmmod).

(this patch is partially based on a different patch suggested by Johannes,
 thus adding a Suggested-by tag for due credit.)

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Reported-by: Junichi Nomura <j-nom...@ce.jp.nec.com>
Suggested-by: Johannes Thumshirn <jthumsh...@suse.de>
---
 drivers/scsi/lpfc/lpfc_sli.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index 1c9fa45df7eb..8befe841adaa 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -13758,7 +13758,14 @@ void lpfc_sli4_els_xri_abort_event_proc(struct 
lpfc_hba *phba)
lpfc_free_rq_buffer(queue->phba, queue);
kfree(queue->rqbp);
}
-   kfree(queue->pring);
+
+   /*
+* The WQs/CQs' pring is bound (same pointer).
+* So free it only once, and not again on WQ.
+*/
+   if (queue->type != LPFC_WQ)
+   kfree(queue->pring);
+
kfree(queue);
return;
 }
-- 
1.8.3.1



[PATCH] scsi: ses: don't get power status of SES device slot on probe

2017-03-29 Thread Mauricio Faria de Oliveira
The commit 08024885a2a3 ("ses: Add power_status to SES device slot")
introduced the 'power_status' attribute to enclosure components and
the associated callbacks.

There are 2 callbacks available to get the power status of a device:
1) ses_get_power_status() for 'struct enclosure_component_callbacks'
2) get_component_power_status() for the sysfs device attribute
(these are available for kernel-space and user-space, respectively.)

However, despite both methods being available to get power status
on demand, that commit also introduced a call to get power status
in ses_enclosure_data_process().

This dramatically increased the total probe time for SCSI devices
on larger configurations, because ses_enclosure_data_process() is
called several times during the SCSI devices probe and loops over
the component devices (but that is another problem, another patch).

That results in a tremendous continuous hammering of SCSI Receive
Diagnostics commands to the enclosure-services device, which does
delay the total probe time for the SCSI devices __significantly__:

  Originally, ~34 minutes on a system attached to ~170 disks:

[ 9214.490703] mpt3sas version 13.100.00.00 loaded
...
[11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0),
   ordered(0), scsi_level(6), cmd_que(1)

  With this patch, it decreased to ~2.5 minutes -- a 13.6x faster

[ 1002.992533] mpt3sas version 13.100.00.00 loaded
...
[ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0),
   ordered(0), scsi_level(6), cmd_que(1)

Another downside is that the SCSI device class lock is held during
the executions of ses_enclosure_data_process() since it's a callee
of ses_intf_add(), which is executed under that lock in device_add().
This ends up blocking the parallel SCSI scan functionality because
device_add() depends on that lock, and also the removal of devices
via device_del() (e.g., during an enclosure reboot).

Back to the commit discussion.. on the ses_get_power_status() call
introduced in ses_enclosure_data_process(): impact of removing it.

That may possibly be in place to initialize the power status value
on device probe.  However, those 2 functions available to retrieve
that value _do_ automatically refresh/update it.  So the potential
benefit would be a direct access of the 'power_status' field which
does not use the callbacks...

But the only reader of 'struct enclosure_component::power_status'
is the get_component_power_status() callback for sysfs attribute,
and it _does_ check for and call the .get_power_status callback,
(which indeed is defined and implemented by that commit), so the
power status value is, again, automatically updated.

So, the remaining potential for a direct/non-callback access to
the power_status attribute would be out-of-tree modules -- well,
for those, if they are for whatever reason interested in values
that are set during device probe and not up-to-date by the time
they need it.. well, that would be curious.

So, for out-of-tree modules and people still interested in this
behavior (in case I might have missed something about it, other
than the total probe time impact): let's add a module parameter
to keep that check turned on (disabled by default).

Also, to handle that more properly, set the initial power state
value to '-1' (i.e., uninitialized) instead of '1' (power 'on'),
and check for it in that callback which may do an direct access
to the field value _if_ a callback function is not defined.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot")
---
 drivers/misc/enclosure.c | 6 +-
 drivers/scsi/ses.c   | 9 -
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed7146e9b..d3a8a13c4247 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -148,7 +148,7 @@ struct enclosure_device *
for (i = 0; i < components; i++) {
edev->component[i].number = -1;
edev->component[i].slot = -1;
-   edev->component[i].power_status = 1;
+   edev->component[i].power_status = -1;
}
 
mutex_lock(_list_lock);
@@ -594,6 +594,10 @@ static ssize_t get_component_power_status(struct device 
*cdev,
 
if (edev->cb->get_power_status)
edev->cb->get_power_status(edev, ecomp);
+
+   if (ecomp->power_status == -1)
+   return -EINVAL;
+
return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off");
 }
 
diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index 50adabbb5808..2fafefd419c1 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -36,6 +36,12 @@
 
 #include 
 
+static bool power_status_on_probe; /* default: false, 0. */
+module_param(power_status

Re: [PATCHv3 1/6] scsi_error: count medium access timeout only once per EH run

2017-03-13 Thread Mauricio Faria de Oliveira

On 03/13/2017 11:48 AM, Hannes Reinecke wrote:

This is assuming that we're always running on a scsi_disk, and that
scsi_disk is the only one implementing 'eh_action'.
Neither of which is necessarily true.


Ah, OK. Thanks for explaining.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCHv3 1/6] scsi_error: count medium access timeout only once per EH run

2017-03-13 Thread Mauricio Faria de Oliveira
on access commands like
  * test unit ready (so wrongly see the device as having a successful
- * recovery)
+ * recovery).
+ * We have to be careful to count a medium access failure only once
+ * per SCSI EH run; there might be several timed out commands which
+ * will cause the 'max_medium_access_timeouts' counter to trigger
+ * after the first SCSI EH run already and set the device to offline.
  **/
@@ -1714,7 +1724,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, 
int eh_disp)

 * process of recovering or has it suffered an internal failure
 * that prevents access to the storage medium.
 */
-   sdkp->medium_access_timed_out++;
+   if (!sdkp->medium_access_reset) {
+   sdkp->medium_access_timed_out++;
+   sdkp->medium_access_reset = 1;
+   }

/*
 * If the device keeps failing read/write commands but TEST UNIT
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 4dac35e..6a4f75a 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -106,6 +106,7 @@ struct scsi_disk {
unsignedrc_basis: 2;
unsignedzoned: 2;
unsignedurswrz : 1;
+   unsignedmedium_access_reset : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)











--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH v3 01/16] lpfc: Correct WQ creation for pagesize

2017-03-07 Thread Mauricio Faria de Oliveira

Hello stable kernel maintainers,

On 03/07/2017 12:35 AM, Martin K. Petersen wrote:

Mauricio> Please flag this patch for stable.

Mauricio> This patch resolves a serious problem on IBM Power systems at
Mauricio> least.

Both patches are already upstream so I can't tag them for stable. Either
you or James should mail sta...@vger.kernel.org and request for the
patches to be queued up.


Can this commit be included on stable v4.4+ , please? (in Linus tree)

8ea73db486cda442f0671f4bc9c03a76be398a28 "scsi: lpfc: Correct WQ 
creation for pagesize"


Thanks!


For documentation purposes, the resolved issue is described on [1].

Following is the rationale for the stable kernel version asked for, in
case Broadcom/others would like to discuss/contribute.

These are the 3 commits which introduced the lines changed by the 
desired fix commit


0c651878ba3018bb4bbfa2ccd0a876bebb618768 "[SCSI] lpfc 8.3.41: Fixed 
support for 128 byte WQEs"
5a6f133eea2d0b4f8f75367b803fef0f03acf268 "[SCSI] lpfc 8.3.22: Add 
new mailbox command and new BSG fix"
c31098cef5e091e22a02ff255f911e0ad71cc393 "[SCSI] lpfc 8.3.23: Fixes 
related to new hardware"


So, checking the oldest release git tag which contains the most recent
commit (so the fix applies cleanly)

$ git tag --contains 0c651878ba3018bb4bbfa2ccd0a876bebb618768 | 
grep -v rc | sort -V

v3.12
<...>
v4.10

And it is cherry-picked/applies cleanly on v3.12:

$ git checkout -b lpfc-v3.12 v3.12
Switched to a new branch 'lpfc-v3.12'

$ git cherry-pick 8ea73db486cda442f0671f4bc9c03a76be398a28
[lpfc-v3.12 c2b0912e6135] scsi: lpfc: Correct WQ creation for pagesize
Author: James Smart <jsmart2...@gmail.com>
2 files changed, 7 insertions(+), 4 deletions(-)

However, v3.12 seems too old to go back, and it's unclear whether there
was any problem actually reported on these older kernel releases, or
whether the hardware/configuration that reproduces this problem is
supported on v3.12 at all.

On PowerPC (which is directly affected by the fix, as at least IBM
Power servers uses a 64k page size by default on most distros), this
fix is probably only required on v4.4+, so to cover a distro for which
Broadcom doesn't ship inbox drivers to (w/ fix included).

So, I'll ask it for stable v4.4+, and Broadcom/others can expand the
range if they see fit/required.

[1] http://www.spinics.net/lists/linux-scsi/msg105886.html

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-03-07 Thread Mauricio Faria de Oliveira

Martin,

On 03/07/2017 02:24 AM, Benjamin Herrenschmidt wrote:


On Mon, 2017-03-06 at 22:46 -0500, Martin K. Petersen wrote:

I don't recall a consensus being reached on this patch.



What would be the opposition ? Without it kexec breaks. With it, it
works ...


That is the argument I'd present/ask for consideration too.

I think I should have included this in the tested-by tag email, for
documentation/evidence: no regression observed in system shutdown path.

Thanks,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH v3 01/16] lpfc: Correct WQ creation for pagesize

2017-03-07 Thread Mauricio Faria de Oliveira

Hi Martin,

On 03/07/2017 12:35 AM, Martin K. Petersen wrote:

Mauricio> Please flag this patch for stable.



Both patches are already upstream so I can't tag them for stable. Either
you or James should mail sta...@vger.kernel.org and request for the
patches to be queued up.


Right, sorry; I missed checking the right tree. Thanks for the pointers.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-03-06 Thread Mauricio Faria de Oliveira

On 02/12/2017 07:49 PM, Anton Blanchard wrote:

We see lpfc devices regularly fail during kexec. Fix this by adding
a shutdown method which mirrors the remove method.


Reviewed-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Tested-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>

@mkp, @jejb: may you please flag this patch for stable?  Thank you.

This patch fixes a problem of some adapter ports going offline after
reboot on the following adapter, at least:

[10df:f100] Saturn-X: LightPulse Fibre Channel Host Adapter

# dmesg | grep lpfc
...
[   61.676728] lpfc 0003:12:00.0: 6:0442 Adapter failed to init,
mbxCmd x88 CONFIG_PORT, mbxStatus x0 Data: x0
[   92.766717] lpfc 0003:12:00.0: 6:0442 Adapter failed to init,
mbxCmd x88 CONFIG_PORT, mbxStatus x0 Data: x0
[  123.856717] lpfc 0003:12:00.0: 6:0442 Adapter failed to init,
mbxCmd x88 CONFIG_PORT, mbxStatus x0 Data: x0
...

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH v3 01/16] lpfc: Correct WQ creation for pagesize

2017-03-03 Thread Mauricio Faria de Oliveira

Hi Martin and James,

On 02/12/2017 07:52 PM, James Smart wrote:

Correct WQ creation for pagesize


Reviewed-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>


Please flag this patch for stable.

This patch resolves a serious problem on IBM Power systems at least.

An (apparently constant) series of invalid iotags cause a series of
SCSI command aborts that escalates into host reset, which fails too, 
bringing the adapter offline.


And it takes forever to boot, as udev waits on the initial SCSI I/O.

Thank you.


[   17.044690] lpfc :01:00.1: 1:0372 iotag x0 is out off range: max 
iotag (x840)

...
[  723.680700] lpfc :01:00.1: 1:(0):0714 SCSI layer issued Bus Reset 
Data: x2003
[  723.680873] lpfc :01:00.1: 1:(0):3172 SCSI layer issued Host 
Reset Data:
[  724.052506] lpfc :01:00.1: 1:0383 Error -5 during scsi sgl post 
operation
[  724.052701] lpfc :01:00.1: 1:(0):3323 Failed host reset, bring it 
offline



--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH v2 10/11] lpfc: Add missing memory barrier

2017-03-03 Thread Mauricio Faria de Oliveira

On 12/19/2016 09:07 PM, James Smart wrote:

On loosely ordered memory systems (PPC for example), the WQE elements
were being updated in memory, but not necessarily flushed before the
separate doorbell was written to hw which would cause hw to dma the
WQE element. Thus, the hardware occasionally received partially
updated WQE data.

Add the memory barrier after updating the WQE memory.


Reviewed-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>

Martin, may you please flag this patch for stable?

Thank you,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



Re: [PATCH 2/2] qla2xxx: Avoid that issuing a LIP triggers a kernel crash

2017-01-25 Thread Mauricio Faria de Oliveira

Hi Martin,

On 01/25/2017 09:29 PM, Martin K. Petersen wrote:

Please do a proper patch submission for this fix.


Okay, I submitted a v2 patch w/ the suggested change.

However, the original patch has been submitted by Bart,
so I believe credit is due, but not sure how to handle
this case.

Thus, please feel free to change the sign-off line as
appropriate here.

Thanks,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] qla2xxx: Avoid that issuing a LIP triggers a kernel crash

2017-01-25 Thread Mauricio Faria de Oliveira
Avoid that issuing a LIP as follows:

  find /sys -name 'issue_lip'|while read f; do echo 1 > $f; done

triggers the following:

BUG: unable to handle kernel NULL pointer dereference at (null)
Call Trace:
 qla2x00_abort_all_cmds+0xed/0x140 [qla2xxx]
 qla2x00_abort_isp_cleanup+0x1e3/0x280 [qla2xxx]
 qla2x00_abort_isp+0xef/0x690 [qla2xxx]
 qla2x00_do_dpc+0x36c/0x880 [qla2xxx]
 kthread+0x10c/0x140

Note: this patch is a slight change of the original patch
  sent by Bart, submitted by request of mkp.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Reported-by: Bart Van Assche <bart.vanass...@sandisk.com>
Fixes: 1535aa75a3d8 ("qla2xxx: fix invalid DMA access after command aborts in 
PCI device remove")
Cc: Himanshu Madhani <himanshu.madh...@cavium.com>
Cc: <sta...@vger.kernel.org>
---
 drivers/scsi/qla2xxx/qla_os.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 0a000ecf0881..a17cb63b3fd5 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1600,6 +1600,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
srb_t *sp;
struct qla_hw_data *ha = vha->hw;
struct req_que *req;
+   struct scsi_cmnd *scmd;
 
qlt_host_reset_handler(ha);
 
@@ -1613,10 +1614,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) {
sp = req->outstanding_cmds[cnt];
if (sp) {
+   scmd = GET_CMD_SP(sp);
+
/* Don't abort commands in adapter during EEH
 * recovery as it's not accessible/responding.
 */
-   if (!ha->flags.eeh_busy) {
+   if (scmd && !ha->flags.eeh_busy) {
/* Get a reference to the sp and drop 
the lock.
 * The reference ensures this 
sp->done() call
 * - and not the call in 
qla2xxx_eh_abort() -
@@ -1624,7 +1627,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
 */
sp_get(sp);

spin_unlock_irqrestore(>hardware_lock, flags);
-   qla2xxx_eh_abort(GET_CMD_SP(sp));
+   qla2xxx_eh_abort(scmd);
spin_lock_irqsave(>hardware_lock, 
flags);
}
req->outstanding_cmds[cnt] = NULL;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] qla2xxx: Avoid that issuing a LIP triggers a kernel crash

2017-01-24 Thread Mauricio Faria de Oliveira

Hi Bart,

First of all, sorry for the new bug; I didn't realize the pointer could
be NULL at this scenario.

On 01/23/2017 02:34 PM, Bart Van Assche wrote:

@@ -1624,7 +1627,8 @@ qla2x00_abort_all_cmds(scsi_qla_host_t *vha, int res)
 */
sp_get(sp);

spin_unlock_irqrestore(>hardware_lock, flags);
-   qla2xxx_eh_abort(GET_CMD_SP(sp));
+   if (scmd)
+   qla2xxx_eh_abort(scmd);
spin_lock_irqsave(>hardware_lock, 
flags);
}


Now, this chunk has a problem with reference counting (and unnecessary
spin-locking), which we can avoid by simply moving up this NULL check.

The call to sp_get() increments the sp->ref_count, but if you skip the
call to qla2xxx_eh_abort() you don't get the decrement from the call to
sp->done() at abort handling from ISR, e.g., qla24xx_abort_iocb_entry().
[or if the command completed successfully between issue/complete abort,
at the completion from ISR, e.g., qla2x00_process_completed_request().]

The sp->done() call just below this chunk was supposed to drop the
initial reference [set at qla2xxx_queuecommand()] at a time we did
not call qla2xxx_eh_abort() yet... but now that we __may__ call it
(and get that sp->done() call from the ISR abort handling), we need
to only increment it if we're going to drop it.

That should be resolved with this slight change to your patch
(which also helps w/ the spin-locking).  What do you/others think?

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 0a000ecf0881..a17cb63b3fd5 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1600,6 +1600,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
srb_t *sp;
struct qla_hw_data *ha = vha->hw;
struct req_que *req;
+   struct scsi_cmnd *scmd;

qlt_host_reset_handler(ha);

@@ -1613,10 +1614,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data 
*ha)

for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) {
sp = req->outstanding_cmds[cnt];
if (sp) {
+   scmd = GET_CMD_SP(sp);
+
/* Don't abort commands in adapter 
during EEH
 * recovery as it's not 
accessible/responding.

 */
-   if (!ha->flags.eeh_busy) {
+   if (scmd && !ha->flags.eeh_busy) {
/* Get a reference to the sp 
and drop the lock.
 * The reference ensures this 
sp->done() call
 * - and not the call in 
qla2xxx_eh_abort() -

@@ -1624,7 +1627,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
 */
sp_get(sp);

spin_unlock_irqrestore(>hardware_lock, flags);
-   qla2xxx_eh_abort(GET_CMD_SP(sp));
+   qla2xxx_eh_abort(scmd);

spin_lock_irqsave(>hardware_lock, flags);
}
    req->outstanding_cmds[cnt] = NULL;


Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>


--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-21 Thread Mauricio Faria de Oliveira

On 12/21/2016 05:50 AM, Christoph Hellwig wrote:
> How do you even get an unaligned residual count?  Except for SES
> processor devices (which will only issue BLOCK_PC commands) this is
> not allowed by SPC:
>
> "The residual count shall be reported in bytes if the peripheral device
>  type in the destination target descriptor is 03h (i.e., processor 
device),

>  and in destination device blocks for all other device type codes.

On 12/21/2016 06:09 AM, Hannes Reinecke wrote:
> Which actually would be pretty much my objection, too.
>
> This would only be applicable for 512e drives, where we _might_ end up
> with a residual smaller than the physical sector size.
> But that should be handled by firmware; after all, that's what the 'e'
> implies, right?

On 12/21/2016 12:01 PM, Martin K. Petersen wrote:

I agree with Christoph and Hannes. Some of this falls into the gray area
that's outside of the T10 spec (HBA programming interface guarantees)
but it seems like a deficiency in the HBA to report a byte count that's
not a multiple of the logical block size. A block can't be partially
written. Either it made it or it didn't. Regardless of how the I/O is
being broken up into frames at the transport level and at which offset
the transfer was interrupted.


Christoph, Hannes, Martin,

Thank you all for your comments and pointers to the documentation/spec.
I'll carry it on with the HBA and storage folks.

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-19 Thread Mauricio Faria de Oliveira
if=/dev/sda of=/dev/null bs=4096 count=1
  skip=16909064 iflag=direct 2>/dev/null; dmesg
[...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272512, count=8
[...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
[...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
[...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
  (sector size 4096, remainder 3072, resid 1024)
[...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
[...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
[...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
  (sector size 4096, remainder 3072, resid 1024)
[...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
[...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
[...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
  (sector size 4096, remainder 3072, resid 1024)
[...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
[...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 4096 bytes done.
[...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.

Apologies for the ridiculously long commit message with description and
test-cases, but this problem has been relatively difficult to reproduce
and understand, so I thought the documentation/instructions for working
with that could be helpful for others too.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/scsi_lib.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2cca9cffc63f..190ab28cfb2d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -821,6 +821,45 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
}
 
/*
+* If the SCSI command is successful but has residual bytes,
+* align good_bytes to the device sector size to ensure the
+* requeued request (to complete the remainder transfer) is
+* also aligned and does not fail alignment/size validation.
+*/
+   if (unlikely(!result && scsi_get_resid(cmd) &&
+req->cmd_type == REQ_TYPE_FS)) {
+
+   unsigned int sector_size = cmd->device->sector_size;
+   unsigned int remainder = good_bytes % sector_size;
+   int resid = scsi_get_resid(cmd);
+
+   SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, cmd,
+   "checking %d bytes for alignment "
+   "(sector size %u, remainder %u, resid %d)\n",
+   good_bytes, sector_size, remainder, resid));
+
+   if (remainder) {
+   good_bytes -= remainder;
+   resid += remainder;
+   scsi_set_resid(cmd, resid);
+
+   /*
+* If less than one device sector has completed, retry 
the
+* request after delay (up to the retry limit) or 
timeout.
+*/
+   if (!good_bytes) {
+   if ((++cmd->retries) <= cmd->allowed
+   && !scsi_noretry_cmd(cmd)) {
+   goto delayed_retry;
+   } else {
+   set_host_byte(cmd, DID_TIME_OUT);
+   goto error;
+   }
+   }
+   }
+   }
+
+   /*
 * special case: failed zero length commands always need to
 * drop down into the retry code. Otherwise, if we finished
 * all bytes in the request we are done now.
@@ -845,6 +884,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int 
good_bytes)
if (result == 0)
goto requeue;
 
+   error:
error = __scsi_error_from_host_byte(cmd, result);
 
if (host_byte(result) == DID_RESET) {
@@ -985,6 +1025,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
__scsi_queue_insert(cmd, SCSI_MLQUEUE_EH_RETRY, 0);
break;
case ACTION_DELAYED_RETRY:
+   delayed_retry:
/* Retry the same command after a delay */
__scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY, 0);
break;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] block: allow WRITE_SAME commands with the SG_IO ioctl

2016-12-15 Thread Mauricio Faria de Oliveira
The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]

The problem can be reproduced with the sg_write_same command

  # sg_write_same --num 1 --xferlen 512 /dev/sda
  #

  # capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
Write same: pass through os error: Operation not permitted
  #

For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:

  # capsh --drop=cap_sys_rawio -- -c \
'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
  #

So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:

  # capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
  #

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

  [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
  [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
  [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
  [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 
00
  [...] blk_update_request: I/O error, dev sda, sector 17096824

Links:
[1] 
http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
Signed-off-by: Brahadambal Srinivasan <la...@linux.vnet.ibm.com>
Reported-by: Manjunatha H R <manju...@in.ibm.com>
---
 block/scsi_ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 0774799..c6fee74 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -182,6 +182,9 @@ static void blk_set_cmd_filter_defaults(struct 
blk_cmd_filter *filter)
__set_bit(WRITE_16, filter->write_ok);
__set_bit(WRITE_LONG, filter->write_ok);
__set_bit(WRITE_LONG_2, filter->write_ok);
+   __set_bit(WRITE_SAME, filter->write_ok);
+   __set_bit(WRITE_SAME_16, filter->write_ok);
+   __set_bit(WRITE_SAME_32, filter->write_ok);
__set_bit(ERASE, filter->write_ok);
__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
__set_bit(MODE_SELECT, filter->write_ok);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()

2016-11-23 Thread Mauricio Faria de Oliveira

On 11/23/2016 12:12 PM, Johannes Thumshirn wrote:

Looks good and sorry for the bug,
Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de>


Thanks for the quick review. Not a problem!
This problem turned out to be a good learning exercise. :)


--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()

2016-11-23 Thread Mauricio Faria de Oliveira

Due credit; an oversight.

On 11/23/2016 10:33 AM, Mauricio Faria de Oliveira wrote:

Reported-by: Harsha Thyagaraja <hathy...@in.ibm.com>


Cc: sta...@vger.kernel.org # v4.8
Fixes: 22466da5b4b7 ("lpfc: Fix possible NULL pointer dereference")
Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>




--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()

2016-11-23 Thread Mauricio Faria de Oliveira
The BUG_ON() recently introduced in lpfc_sli_ringtxcmpl_put()
is hit in the lpfc_els_abort() > lpfc_sli_issue_abort_iotag()
 > lpfc_sli_abort_iotag_issue() function path [similar names],
due to 'piocb->vport == NULL':

BUG_ON(!piocb || !piocb->vport);

This happens because lpfc_sli_abort_iotag_issue() doesn't set
the 'abtsiocbp->vport' pointer -- but this is not the problem.

Previously, lpfc_sli_ringtxcmpl_put() accessed 'piocb->vport'
only if 'piocb->iocb.ulpCommand' is neither CMD_ABORT_XRI_CN
nor CMD_CLOSE_XRI_CN, which are the only possible values for
lpfc_sli_abort_iotag_issue():

lpfc_sli_ringtxcmpl_put():

if ((unlikely(pring->ringno == LPFC_ELS_RING)) &&
   (piocb->iocb.ulpCommand != CMD_ABORT_XRI_CN) &&
   (piocb->iocb.ulpCommand != CMD_CLOSE_XRI_CN) &&
(!(piocb->vport->load_flag & FC_UNLOADING)))

lpfc_sli_abort_iotag_issue():

if (phba->link_state >= LPFC_LINK_UP)
iabt->ulpCommand = CMD_ABORT_XRI_CN;
else
iabt->ulpCommand = CMD_CLOSE_XRI_CN;

So, this function path would not have hit this possible NULL
pointer dereference before.

In order to fix this regression, move the second part of the
BUG_ON() check prior to the pointer dereference that it does
check for.

For reference, this is the stack trace observed. The problem
happened because an unsolicited event was received - a PLOGI
was received after our PLOGI was issued but not yet complete,
so the discovery state machine goes on to sw-abort our PLOGI.

kernel BUG at drivers/scsi/lpfc/lpfc_sli.c:1326!
Oops: Exception in kernel mode, sig: 5 [#1]
<...>
NIP [...] lpfc_sli_ringtxcmpl_put+0x1c/0xf0 [lpfc]
LR  [...] __lpfc_sli_issue_iocb_s4+0x188/0x200 [lpfc]
Call Trace:
[...] [...] __lpfc_sli_issue_iocb_s4+0xb0/0x200 [lpfc] (unreliable)
[...] [...] lpfc_sli_issue_abort_iotag+0x2b4/0x350 [lpfc]
[...] [...] lpfc_els_abort+0x1a8/0x4a0 [lpfc]
[...] [...] lpfc_rcv_plogi+0x6d4/0x700 [lpfc]
[...] [...] lpfc_rcv_plogi_plogi_issue+0xd8/0x1d0 [lpfc]
[...] [...] lpfc_disc_state_machine+0xc0/0x2b0 [lpfc]
[...] [...] lpfc_els_unsol_buffer+0xcc0/0x26c0 [lpfc]
[...] [...] lpfc_els_unsol_event+0xa8/0x220 [lpfc]
[...] [...] lpfc_complete_unsol_iocb+0xb8/0x138 [lpfc]
[...] [...] lpfc_sli4_handle_received_buffer+0x6a0/0xec0 [lpfc]
[...] [...] lpfc_sli_handle_slow_ring_event_s4+0x1c4/0x240 [lpfc]
[...] [...] lpfc_sli_handle_slow_ring_event+0x24/0x40 [lpfc]
[...] [...] lpfc_do_work+0xd88/0x1970 [lpfc]
[...] [...] kthread+0x108/0x130
[...] [...] ret_from_kernel_thread+0x5c/0xbc
<...>

Cc: sta...@vger.kernel.org # v4.8
Fixes: 22466da5b4b7 ("lpfc: Fix possible NULL pointer dereference")
Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/lpfc/lpfc_sli.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index c5326055beee..f4f77c5b0c83 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -1323,18 +1323,20 @@ struct lpfc_iocbq *
 {
lockdep_assert_held(>hbalock);
 
-   BUG_ON(!piocb || !piocb->vport);
+   BUG_ON(!piocb);
 
list_add_tail(>list, >txcmplq);
piocb->iocb_flag |= LPFC_IO_ON_TXCMPLQ;
 
if ((unlikely(pring->ringno == LPFC_ELS_RING)) &&
   (piocb->iocb.ulpCommand != CMD_ABORT_XRI_CN) &&
-  (piocb->iocb.ulpCommand != CMD_CLOSE_XRI_CN) &&
-   (!(piocb->vport->load_flag & FC_UNLOADING)))
-   mod_timer(>vport->els_tmofunc,
- jiffies +
- msecs_to_jiffies(1000 * (phba->fc_ratov << 1)));
+  (piocb->iocb.ulpCommand != CMD_CLOSE_XRI_CN)) {
+   BUG_ON(!piocb->vport);
+   if (!(piocb->vport->load_flag & FC_UNLOADING))
+   mod_timer(>vport->els_tmofunc,
+ jiffies +
+ msecs_to_jiffies(1000 * (phba->fc_ratov << 
1)));
+   }
 
return 0;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qla2xxx: do not abort all commands in the adapter during EEH recovery

2016-11-14 Thread Mauricio Faria de Oliveira
The previous commit ("qla2xxx: fix invalid DMA access after command
aborts in PCI device remove") introduced a regression during an EEH
recovery, since the change to the qla2x00_abort_all_cmds() function
calls qla2xxx_eh_abort(), which verifies the EEH recovery condition
but handles it heavy-handed. (commit a465537ad1a4 "qla2xxx: Disable
the adapter and skip error recovery in case of register disconnect.")

This problem warrants a more general/optimistic solution right into
qla2xxx_eh_abort()  (eg in case a real command abort arrives during
EEH recovery, or if it takes long enough to trigger command aborts);
but it's still worth to add a check to ensure the code added by the
previous commit is correct and contained within its owner function.

This commit just adds a 'if (!ha->flags.eeh_busy)' check around it.
(ahem; a trivial fix for this -rc series; sorry for this oversight.)

With it applied, both PCI device remove and EEH recovery works fine.

Fixes: 1535aa75a3d8 ("scsi: qla2xxx: fix invalid DMA access after
command aborts in PCI device remove")
Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/qla2xxx/qla_os.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 567fa080e261..56d6142852a5 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1456,15 +1456,20 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) {
sp = req->outstanding_cmds[cnt];
if (sp) {
-   /* Get a reference to the sp and drop the lock.
-* The reference ensures this sp->done() call
-* - and not the call in qla2xxx_eh_abort() -
-* ends the SCSI command (with result 'res').
+   /* Don't abort commands in adapter during EEH
+* recovery as it's not accessible/responding.
 */
-   sp_get(sp);
-   spin_unlock_irqrestore(>hardware_lock, 
flags);
-   qla2xxx_eh_abort(GET_CMD_SP(sp));
-   spin_lock_irqsave(>hardware_lock, flags);
+   if (!ha->flags.eeh_busy) {
+   /* Get a reference to the sp and drop 
the lock.
+* The reference ensures this 
sp->done() call
+* - and not the call in 
qla2xxx_eh_abort() -
+* ends the SCSI command (with result 
'res').
+*/
+   sp_get(sp);
+   
spin_unlock_irqrestore(>hardware_lock, flags);
+   qla2xxx_eh_abort(GET_CMD_SP(sp));
+   spin_lock_irqsave(>hardware_lock, 
flags);
+   }
req->outstanding_cmds[cnt] = NULL;
sp->done(vha, sp, res);
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from lpfc_send_taskmgmt()

2016-07-25 Thread Mauricio Faria de Oliveira

On 07/22/2016 05:18 PM, Martin K. Petersen wrote:

Applied to 4.8/scsi-queue.


Can this be routed into stable?

The 'Fixes:' commit was introduced in v4.2, and there's longterm v4.4
and stables v4.5 and v4.6 after that.

Given an oops (which may panic depending on configuration) got fixed,
and it happens in normal scenarios (eg SCSI EH), it seems appropriate.

Thanks,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] lpfc: support for CPU phys_id and core_id on PowerPC64

2016-06-22 Thread Mauricio Faria de Oliveira

Hi Martin,

On 06/21/2016 11:16 PM, Martin K. Petersen wrote:

Distros are free to carry a patch such as yours. That puts the burden on
them and not on upstream which is going in a different direction as
outlined by Christoph.



This is ultimately Broadcom's decision. It is their driver.


Right, I understand.  I submitted the patch in case they see value in
having it upstream, as some distros discuss/ask about what's the status
(or differences to) upstream.

In some cases, a rationale like this one being documented on a mailing
list is sufficient, provided the patch hasn't received other technical
problems, for example.

Thanks for the review/comments (Christoph too),

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] lpfc: support for CPU phys_id and core_id on PowerPC64

2016-06-21 Thread Mauricio Faria de Oliveira

On 06/21/2016 11:29 AM, Christoph Hellwig wrote:

I really don't think this sort of low level information has business
in a low level driver.  I've been trying to put some infrastructure
together to move this to the core kernel [1], and it would be good to
help use this in more drivers.  Especially given that lpfc may use
blk-mq which will not respect this current assignment for the queue
mapping.


Agree.

However, this patchset targets the non blk-mq/scsi-mq usage of the
lpfc driver, which falls back to CPU-number/queue-number mapping.

This turns out to still be the case w/ some distros where ppc64/le
usually runs, on which it would be easier to adapt this relatively
small change than moving forward w/ blk-mq/scsi-mq, for example --
even if the latter is clearly a superior approach.


[1] http://lists.infradead.org/pipermail/linux-nvme/2016-June/005012.html



--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] lpfc: PowerPC64 topology information, and per-core scheduling

2016-06-21 Thread Mauricio Faria de Oliveira

On 06/01/2016 05:43 PM, Mauricio Faria de Oliveira wrote:

Tested on next-20160601 (with an extra commit for patch 1/2, see commit msg).


FYI, that commit has been accepted into powerpc next [1].

[1] https://git.kernel.org/powerpc/c/f8ab481066e7246e4b272233aa

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: block: don't check request size in blk_cloned_rq_check_limits()

2016-06-16 Thread Mauricio Faria de Oliveira

Hey,

On 06/15/2016 01:34 PM, Brian King wrote:

Mauricio was looking at this, adding him to cc. We did have a KVM config
where we could reproduce this issue as well, I think with some PCI passthrough
adapters. Mauricio - do you have any more details about the KVM config that
reproduced this issue and did you ever try to reproduce this with an upstream
kernel?


It's KVM guest w/ SLES 12 SP1 + kernel updates (3.12.53-60.30 introduced
the error) and PCI passthrough of Emulex FC and FCoE adapters, connected
to 4 storage systems (DS8000, XIV, SVC, and FlashSystem 840).

It seems only the XIV LUNs never hit the problem, but it didn't seem to
be storage-specific, as FS840 LUNs had a mix of hit/not-hit the problem.
One thing is that all paths of a LUN were either failed or not - no mix
within a LUN.

Unfortunately not too much analysis was performed on this system at the
time -- Hannes had already made good progress w/ the customer, and some
test kernel builds that resolved the issue were made available soon.

Now that the topic is under discussion, I've asked for some time slots
on that system, so we can test an upstream kernel and try to reproduce
the problem and analyze it more closely.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from lpfc_send_taskmgmt()

2016-06-07 Thread Mauricio Faria de Oliveira
 (x0) to 
839909376 (x3210)
lpfc: skip scsi_done()
<...>
lpfc 0006:01:00.4: 4:(0):0713 SCSI layer issued Device Reset (0, 0) return 
x2002
<...>
lpfc 0006:01:00.4: 4:(0):0723 SCSI layer issued Target Reset (1, 0) return 
x2002
<...>
lpfc 0006:01:00.4: 4:(0):0714 SCSI layer issued Bus Reset Data: x2002
<...>
lpfc 0006:01:00.4: 4:(0):3172 SCSI layer issued Host Reset Data:
<...>

Fixes: 8b0dff14164d ("lpfc: Add support for using block multi-queue")
Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/lpfc/lpfc_scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 3bd0be6..c7e5695 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -3874,7 +3874,7 @@ int lpfc_sli4_scmd_to_wqidx_distr(struct lpfc_hba *phba,
uint32_t tag;
uint16_t hwq;
 
-   if (shost_use_blk_mq(cmnd->device->host)) {
+   if (cmnd && shost_use_blk_mq(cmnd->device->host)) {
tag = blk_mq_unique_tag(cmnd->request);
hwq = blk_mq_unique_tag_to_hwq(tag);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] lpfc: support for CPU phys_id and core_id on PowerPC64

2016-06-01 Thread Mauricio Faria de Oliveira
This commit uses the macros 'topology_physical_package_id' and
'topology_core_id' from , which are defined in
arch/powerpc/include/asm/topology.h for CONFIG_PPC64 && CONFIG_SMP.

Also, change the initial value for min_phys_id from 0xff to INT_MAX
(the numbers may increment with large steps on some systems).

While in there, include the CPU number in the debug message, which
helps reading it on systems with many CPUs.

This depends on commit 'powerpc: export cpu_to_core_id()' (submitted
to the linuxppc-dev mailing list). Tested on next-20160601 w/ commit.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/lpfc/lpfc_init.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index b43f7ac..03d1946 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -34,6 +34,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -8718,7 +8720,7 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors)
phba->sli4_hba.num_present_cpu));
 
max_phys_id = 0;
-   min_phys_id = 0xff;
+   min_phys_id = INT_MAX;
phys_id = 0;
num_io_channel = 0;
first_cpu = LPFC_VECTOR_MAP_EMPTY;
@@ -8730,6 +8732,9 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors)
cpuinfo = _data(cpu);
cpup->phys_id = cpuinfo->phys_proc_id;
cpup->core_id = cpuinfo->cpu_core_id;
+#elif defined(CONFIG_PPC64) && defined(CONFIG_SMP)
+   cpup->phys_id = topology_physical_package_id(cpu);
+   cpup->core_id = topology_core_id(cpu);
 #else
/* No distinction between CPUs for other platforms */
cpup->phys_id = 0;
@@ -8737,8 +8742,8 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors)
 #endif
 
lpfc_printf_log(phba, KERN_INFO, LOG_INIT,
-   "3328 CPU physid %d coreid %d\n",
-   cpup->phys_id, cpup->core_id);
+   "3328 CPU %d physid %d coreid %d\n",
+   cpu, cpup->phys_id, cpup->core_id);
 
if (cpup->phys_id > max_phys_id)
max_phys_id = cpup->phys_id;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] lpfc: PowerPC64 topology information, and per-core scheduling

2016-06-01 Thread Mauricio Faria de Oliveira
This series enables the association of IO channels at the CPU-core level (i.e.,
with all CPUs/threads of core), which may increase performance depending on the
CPU affinity settings of the workload (e.g., benefit from the per-core caches);
and adds the required pieces to use it on PowerPC64 too (topology information),
which has server processors with many cores/threads and per-core caches.

Although the series include bits for PowerPC64, the per-core scheduling patch
is architecture independent.

Tested on next-20160601 (with an extra commit for patch 1/2, see commit msg).

Mauricio Faria de Oliveira (2):
  lpfc: support for CPU phys_id and core_id on PowerPC64
  lpfc: add option for lpfc_fcp_io_sched (LPFC_FCP_SCHED_BY_CPU_CORE)

 drivers/scsi/lpfc/lpfc_attr.c |  8 --
 drivers/scsi/lpfc/lpfc_hw4.h  |  1 +
 drivers/scsi/lpfc/lpfc_init.c | 65 ---
 drivers/scsi/lpfc/lpfc_scsi.c |  3 +-
 4 files changed, 70 insertions(+), 7 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] lpfc: add option for lpfc_fcp_io_sched (LPFC_FCP_SCHED_BY_CPU_CORE)

2016-06-01 Thread Mauricio Faria de Oliveira
This commit introduces the new value 2 for parameter 'lpfc_fcp_io_sched',
which associates work queues (or IO channels) by core, that is, all CPUs
(or threads) of a particular core are assigned the same IO channel.

The IO channels are assigned to each core in a round-robin fashion.

Tested on next-20160601.

Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>
---
 drivers/scsi/lpfc/lpfc_attr.c |  8 +--
 drivers/scsi/lpfc/lpfc_hw4.h  |  1 +
 drivers/scsi/lpfc/lpfc_init.c | 54 ++-
 drivers/scsi/lpfc/lpfc_scsi.c |  3 ++-
 4 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
index cfec2ec..f8db2ad 100644
--- a/drivers/scsi/lpfc/lpfc_attr.c
+++ b/drivers/scsi/lpfc/lpfc_attr.c
@@ -4517,12 +4517,16 @@ LPFC_ATTR_R(ack0, 0, 0, 1, "Enable ACK0 support");
 # For [0], FCP commands are issued to Work Queues ina round robin fashion.
 # For [1], FCP commands are issued to a Work Queue associated with the
 #  current CPU.
+# For [2], FCP commands are issued to a Work Queue associated with the
+#  current CPU core (same Work Queue for all CPUs of a core).
 # It would be set to 1 by the driver if it's able to set up cpu affinity
 # for FCP I/Os through Work Queue associated with the current CPU. Otherwise,
 # roundrobin scheduling of FCP I/Os through WQs will be used.
+# It would remain set to 2 by the driver, likewise, if 2 was requested.
 */
-LPFC_ATTR_RW(fcp_io_sched, 0, 0, 1, "Determine scheduling algorithm for "
-   "issuing commands [0] - Round Robin, [1] - Current CPU");
+LPFC_ATTR_RW(fcp_io_sched, 0, 0, 2, "Determine scheduling algorithm for "
+   "issuing commands [0] - Round Robin, [1] - Current CPU, "
+   "[2] - Current CPU core");
 
 /*
 # lpfc_fcp2_no_tgt_reset: Determine bus reset behavior
diff --git a/drivers/scsi/lpfc/lpfc_hw4.h b/drivers/scsi/lpfc/lpfc_hw4.h
index 0c7070b..3c63063 100644
--- a/drivers/scsi/lpfc/lpfc_hw4.h
+++ b/drivers/scsi/lpfc/lpfc_hw4.h
@@ -191,6 +191,7 @@ struct lpfc_sli_intf {
 /* Algrithmns for scheduling FCP commands to WQs */
 #defineLPFC_FCP_SCHED_ROUND_ROBIN  0
 #defineLPFC_FCP_SCHED_BY_CPU   1
+#defineLPFC_FCP_SCHED_BY_CPU_CORE  2
 
 /* Delay Multiplier constant */
 #define LPFC_DMULT_CONST   651042
diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 03d1946..917351c 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -8703,6 +8703,7 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors)
 {
int i, idx, saved_chann, used_chann, cpu, phys_id;
int max_phys_id, min_phys_id;
+   int max_core_id, min_core_id, core_id, prev_core_id;
int num_io_channel, first_cpu, chan;
struct lpfc_vector_map_info *cpup;
 #ifdef CONFIG_X86
@@ -8722,6 +8723,9 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors)
max_phys_id = 0;
min_phys_id = INT_MAX;
phys_id = 0;
+   max_core_id = 0;
+   min_core_id = INT_MAX;
+   core_id = 0;
num_io_channel = 0;
first_cpu = LPFC_VECTOR_MAP_EMPTY;
 
@@ -8749,6 +8753,12 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int 
vectors)
max_phys_id = cpup->phys_id;
if (cpup->phys_id < min_phys_id)
min_phys_id = cpup->phys_id;
+
+   if (cpup->core_id > max_core_id)
+   max_core_id = cpup->core_id;
+   if (cpup->core_id < min_core_id)
+   min_core_id = cpup->core_id;
+
cpup++;
}
 
@@ -8820,6 +8830,46 @@ found:
}
 
/*
+* With lpfc_fcp_io_sched per core, associate IO channels with CPUs
+* based only on the core numbers instead of individual CPU numbers,
+* and ignore/overwrite already assigned values (from MSI-x vectors).
+*/
+   if (phba->cfg_fcp_io_sched == LPFC_FCP_SCHED_BY_CPU_CORE) {
+
+   /* The IO channel used by a core */
+   chan = -1;
+
+   /* For each core, assign its (sequential) CPUs the same IO 
channel */
+   prev_core_id = -1;
+   for (core_id = min_core_id; core_id <= max_core_id; core_id++) {
+
+   cpup = phba->sli4_hba.cpu_map;
+   for (cpu = 0; cpu < phba->sli4_hba.num_present_cpu; 
cpu++, cpup++) {
+
+   if (cpup->core_id != core_id)
+   continue;
+
+   /* Round-robin on different cores */
+   if (core_id != prev_core_id) {
+   prev_core_id = core_id;
+   chan = (chan + 1)

Re: [dm-devel] IBM request to allow unprivledged ioctls [Was: Revert "dm mpath: fix stalls when handling invalid ioctls"]

2015-10-29 Thread Mauricio Faria de Oliveira

Hi Mike,

On 10/29/2015 11:18 AM, Mike Snitzer wrote:

Sorry but I fail to see why your request to revert is viable.


No problem. Thanks for moving this for a discussion on a proper fix.

I'm somewhat new to kernel and SCSI workings and its community process,
so that's certainly appreciated.


Wouldn't the correct fix be to train scsi_verify_blk_ioctl() to work
even without SYS_CAP_RAWIO?


I see it would fix the problem as well, but I don't happen to know
all of its usages, so I'd have to defer to question to someone who
knows more of it.


I'm not doubting that the commit caused problems for the case you care
about (unprivledged users issueing ioctls) but that is an entirely
different issue that needs to be discussed more directly (with the
broader linux-scsi community, now cc'd).


Sure. Thanks for opening the discussion.


p.s. leaving full context for linux-scsi so they don't need to track
down other mails (btw, thanks for the detailed patch header but it
enabled me to be skeptical of your request to revert):


You're welcome. If it's been useful for rejecting this patch and
getting a better one later, it's worth it. :)

Kind regards,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [RESEND] qla2xxx: prevent board_disable from running during EEH

2015-06-26 Thread Mauricio Faria de Oliveira

Commit f3ddac1918fe963bcbf8d407a3a3c0881b47248b ([SCSI] qla2xxx:
Disable adapter when we encounter a PCI disconnect.) has introduced a
code that disables the board, releasing some resources, when reading
0x.

In case this happens when there is an EEH, this read will trigger EEH
detection and set PCI channel offline. EEH will be able to recover the
card from this state by doing a reset, so it's a better option than
simply disabling the card.

Since eeh_check_failure will mark the channel as offline before
returning the read value, in case there really was an EEH, we can simply
check for pci_channel_offline, preventing the board_disable code from
running if it's true.

Without this patch, EEH code will try to access those same resources
that board_disable will try to free. This race can cause EEH recovery to
fail.

[  504.370577] EEH: Notify device driver to resume
[  504.370580] qla2xxx [0001:07:00.0]-9002:2: The device failed to 
resume I/O from slot/link_reset.


Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
Cc: casca...@debian.org
Cc: mauri...@linux.vnet.ibm.com
---
 drivers/scsi/qla2xxx/qla_isr.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index a04a1b1..8132926 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -116,7 +116,7 @@ bool
 qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
 {
/* Check for PCI disconnection */
-   if (reg == 0x) {
+   if (reg == 0x  !pci_channel_offline(vha-hw-pdev)) {
if (!test_and_set_bit(PFLG_DISCONNECTED, vha-pci_flags) 
!test_bit(PFLG_DRIVER_REMOVING, vha-pci_flags) 
!test_bit(PFLG_DRIVER_PROBING, vha-pci_flags)) {
--
1.7.10.4



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html