Re: [PATCH v3 0/2] scsi: mpt3sas: prevent oops in the shutdown/unload path
Martin, James, On 02/22/2018 01:07 AM, Martin K. Petersen wrote: The first patch prevents the SCSI error handlers to run once the shutdown/unload path starts. This avoids an oops at least in the host reset handler, on kernels with a recent patch, and also in the abort handler on kernels without that patch. The second patch goes a step further, and prevents the SCSI error handlers to be called for SCSI IO commands running at the shutdown/unload time (in case adapter interrupts are disabled before they complete) by finishing them. Notice the second patch only affects SCSI IO commands. Other command types, e.g., Task Manangement Functions, which are running at the shutdown/unload time still must be handled separately -- this is also covered in patch 1. Applied to 4.16/scsi-fixes. Thank you! Thank you. Are you OK with / would ack this patchset for stable (v4.14+) ? I have the backports ready and can submit if you are OK with it. Thanks, Mauricio
[PATCH v2 1/2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload
This patch adds checks for 'ioc->remove_host' in the SCSI error handlers, so not to access pointers/resources potentially freed in the PCI shutdown/module unload path. The error handlers may be invoked after shutdown/unload, depending on other components. This problem was observed with kexec on a system with a mpt3sas based adapter and an infiniband adapter which takes long enough to shutdown: The mpt3sas driver finished shutting down / disabled interrupt handling, thus some commands have not finished and timed out. Since the system was still running (waiting for the infiniband adapter to shutdown), the scsi error handler for task abort of mpt3sas was invoked, and hit an oops -- either in scsih_abort() because 'ioc->scsi_lookup' was NULL (without commit dbec4c90 "scsi: mpt3sas: lockless command submission"), or later up in scsih_host_reset() (with or without that commit), because it eventually called mpt3sas_base_get_iocstate(). After commit dbec4c90, the oops in scsih_abort() does not occur anymore (_scsih_scsi_lookup_find_by_scmd() is no longer called), but that commit is too big and out of the scope of linux-stable, where this patch might help, so still go for the changes. Also, this might help to prevent similar errors in the future, in case code changes and possibly tries to access freed stuff. Note the fix in scsih_host_reset() is still important anyway. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 74fca18..5ab3caf 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -2835,7 +2835,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, _scsih_tm_display_info(ioc, scmd); sas_device_priv_data = scmd->device->hostdata; - if (!sas_device_priv_data || !sas_device_priv_data->sas_target) { + if (!sas_device_priv_data || !sas_device_priv_data->sas_target || + ioc->remove_host) { sdev_printk(KERN_INFO, scmd->device, "device been deleted! scmd(%p)\n", scmd); scmd->result = DID_NO_CONNECT << 16; @@ -2898,7 +2899,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, _scsih_tm_display_info(ioc, scmd); sas_device_priv_data = scmd->device->hostdata; - if (!sas_device_priv_data || !sas_device_priv_data->sas_target) { + if (!sas_device_priv_data || !sas_device_priv_data->sas_target || + ioc->remove_host) { sdev_printk(KERN_INFO, scmd->device, "device been deleted! scmd(%p)\n", scmd); scmd->result = DID_NO_CONNECT << 16; @@ -2961,7 +2963,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, _scsih_tm_display_info(ioc, scmd); sas_device_priv_data = scmd->device->hostdata; - if (!sas_device_priv_data || !sas_device_priv_data->sas_target) { + if (!sas_device_priv_data || !sas_device_priv_data->sas_target || + ioc->remove_host) { starget_printk(KERN_INFO, starget, "target been deleted! scmd(%p)\n", scmd); scmd->result = DID_NO_CONNECT << 16; @@ -3019,7 +3022,7 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, ioc->name, scmd); scsi_print_command(scmd); - if (ioc->is_driver_loading) { + if (ioc->is_driver_loading || ioc->remove_host) { pr_info(MPT3SAS_FMT "Blocking the host reset\n", ioc->name); r = FAILED; -- 1.8.3.1
[PATCH v3 0/2] scsi: mpt3sas: prevent oops in the shutdown/unload path
The first patch prevents the SCSI error handlers to run once the shutdown/unload path starts. This avoids an oops at least in the host reset handler, on kernels with a recent patch, and also in the abort handler on kernels without that patch. The second patch goes a step further, and prevents the SCSI error handlers to be called for SCSI IO commands running at the shutdown/unload time (in case adapter interrupts are disabled before they complete) by finishing them. Notice the second patch only affects SCSI IO commands. Other command types, e.g., Task Manangement Functions, which are running at the shutdown/unload time still must be handled separately -- this is also covered in patch 1. The patchset is v3 as PATCH 1/2 is identical to its v2 submission, and PATCH 2/2 has been added (in its first version). Thanks to Sreekanth Reddy and Bart Van Assche for review and contributions. Mauricio Faria de Oliveira (2): scsi: mpt3sas: fix oops in error handlers after shutdown/unload scsi: mpt3sas: wait for and flush running commands on shutdown/unload drivers/scsi/mpt3sas/mpt3sas_base.c | 8 drivers/scsi/mpt3sas/mpt3sas_base.h | 3 +++ drivers/scsi/mpt3sas/mpt3sas_scsih.c | 21 - 3 files changed, 23 insertions(+), 9 deletions(-) -- 1.8.3.1
[PATCH 2/2] scsi: mpt3sas: wait for and flush running commands on shutdown/unload
From: Sreekanth Reddy <sreekanth.re...@broadcom.com> This patch finishes all outstanding SCSI IO commands (but not other commands, e.g., task management) in the shutdown and unload paths. It first waits for the commands to complete (this is done after setting 'ioc->remove_host = 1 ', which prevents new commands to be queued) then it flushes commands that might still be running. This avoids triggering error handling (e.g., abort command) for all commands possibly completed by the adapter after interrupts disabled. Suggested-by: Sreekanth Reddy <sreekanth.re...@broadcom.com> Tested-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Signed-off-by: Sreekanth Reddy <sreekanth.re...@broadcom.com> [mauricfo: introduced something in commit message.] Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/mpt3sas/mpt3sas_base.c | 8 drivers/scsi/mpt3sas/mpt3sas_base.h | 3 +++ drivers/scsi/mpt3sas/mpt3sas_scsih.c | 10 +- 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 59a87ca..0aafbfd 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -6297,14 +6297,14 @@ void mpt3sas_base_clear_st(struct MPT3SAS_ADAPTER *ioc, } /** - * _wait_for_commands_to_complete - reset controller + * mpt3sas_wait_for_commands_to_complete - reset controller * @ioc: Pointer to MPT_ADAPTER structure * * This function is waiting 10s for all pending commands to complete * prior to putting controller in reset. */ -static void -_wait_for_commands_to_complete(struct MPT3SAS_ADAPTER *ioc) +void +mpt3sas_wait_for_commands_to_complete(struct MPT3SAS_ADAPTER *ioc) { u32 ioc_state; @@ -6377,7 +6377,7 @@ void mpt3sas_base_clear_st(struct MPT3SAS_ADAPTER *ioc, is_fault = 1; } _base_reset_handler(ioc, MPT3_IOC_PRE_RESET); - _wait_for_commands_to_complete(ioc); + mpt3sas_wait_for_commands_to_complete(ioc); _base_mask_interrupts(ioc); r = _base_make_ioc_ready(ioc, type); if (r) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h index 789bc42..99ccf83 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.h +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h @@ -1433,6 +1433,9 @@ void mpt3sas_base_update_missing_delay(struct MPT3SAS_ADAPTER *ioc, int mpt3sas_port_enable(struct MPT3SAS_ADAPTER *ioc); +void +mpt3sas_wait_for_commands_to_complete(struct MPT3SAS_ADAPTER *ioc); + /* scsih shared API */ struct scsi_cmnd *mpt3sas_scsih_scsi_lookup_get(struct MPT3SAS_ADAPTER *ioc, diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 5ab3caf..c2ea13c7 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -4456,7 +4456,7 @@ static int _scsih_set_satl_pending(struct scsi_cmnd *scmd, bool pending) st = scsi_cmd_priv(scmd); mpt3sas_base_clear_st(ioc, st); scsi_dma_unmap(scmd); - if (ioc->pci_error_recovery) + if (ioc->pci_error_recovery || ioc->remove_host) scmd->result = DID_NO_CONNECT << 16; else scmd->result = DID_RESET << 16; @@ -9742,6 +9742,10 @@ static void scsih_remove(struct pci_dev *pdev) unsigned long flags; ioc->remove_host = 1; + + mpt3sas_wait_for_commands_to_complete(ioc); + _scsih_flush_running_cmds(ioc); + _scsih_fw_event_cleanup_queue(ioc); spin_lock_irqsave(>fw_event_lock, flags); @@ -9818,6 +9822,10 @@ static void scsih_remove(struct pci_dev *pdev) unsigned long flags; ioc->remove_host = 1; + + mpt3sas_wait_for_commands_to_complete(ioc); + _scsih_flush_running_cmds(ioc); + _scsih_fw_event_cleanup_queue(ioc); spin_lock_irqsave(>fw_event_lock, flags); -- 1.8.3.1
Re: [PATCH v2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload
Hi Sreekanth, On 02/15/2018 03:48 AM, Sreekanth Reddy wrote: During the shutdown time, I don't want the outstanding IOs to timeout due to disabling of interrupts and go the TM path. So I wanted to clear out all the Outstanding IOs in the shutdown path itself instead of clearing them in TM path. Ah. I see. I didn't realize that. Good catch. But if there are already some TMs are outstanding while initiating the shutdown operation then your patch looks good to me. Ok. Thanks for the review. May be can you include my set of changes along with your solution? Sure. I'll send a v3 with both parts and due credit. cheers, Mauricio
Re: [PATCH v2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload
Hi Sreenkanth, Thanks for reviewing. On 02/12/2018 04:28 AM, Sreekanth Reddy wrote: Instead of returning the scmd with DID_NO_CONNECT in TM path, can we wait for some time (10 seconds) in shutdown/unload path for the outstanding commands to complete and even then [if] the scmds are outstanding then return all the outstanding scmds with DID_NO_CONNECT in the shutdown/unload path itself as shown below, Apparently there is a problem with that. That is not enough for TM commands; it's only for SCSI IO commands. The TM commands use the high-priority queue, and SCSI IO commands use the SCSI IO queue. But this patch only waits for and return commands in the latter: > + mpt3sas_wait_for_commands_to_complete(ioc); > + _scsih_flush_running_cmds(ioc); They only loop up until 'ioc->scsiio_depth', but TM commands get SMIDs above that [see mpt3sas_base_free_smid() and mpt3sas_scsih_issue_tm()]. So, there's an exposure for abort/reset/other TM commands. For example, the SCSI IO commands finish quickly in wait_for_commands_to_complete(), then the shutdown function continues, disables interrupts, and release pointers/memory. Then, an outstanding abort/reset command is finished, its completion interrupt is not serviced, and the SCSI EH for it kicks in/continues, and so it escalates up, and hit that oops in host reset. Is there any problem with the patch that I proposed? It seems okay to have another check in error path (not hot/performance path), and that check is already used in many other points in the code, for the same reasons (exit early before the code attempts to use stuff that might be released). Thanks again, -- Mauricio Faria de Oliveira IBM Linux Technology Center
test results
The test-case results with PATCH v2. scsih_abort() = Without patch: [ 362.669743] setting logging_level(0x1000) [ 362.705074] mpt3sas_cm0: skip free_smid/scsi_done scmd(c01fd4f2bd40) [ 363.956579] sd 16:0:1:0: [sdf] Synchronizing SCSI cache [ 363.956844] sd 16:0:0:0: [sde] Synchronizing SCSI cache [ 363.957147] mpt3sas_cm0: sending diag reset !! [ 365.073568] mpt3sas_cm0: diag reset: SUCCESS [ 365.090316] mpt3sas_cm0: sleep on shutdown [ 366.120209] mpt3sas_cm0: sleep on shutdown ... [ 373.419954] sd 16:0:1:0: attempting task abort! scmd(c01fd4f2bd40) ... [ 373.420256] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress! [ 373.420300] sd 16:0:1:0: task abort: FAILED scmd(c01fd4f2bd40) [ 373.459961] sd 16:0:1:0: attempting device reset! scmd(c01fd4f2bd40) ... [ 373.460271] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress! [ 373.460315] sd 16:0:1:0: device reset: FAILED scmd(c01fd4f2bd40) [ 373.460362] scsi target16:0:1: attempting target reset! scmd(c01fd4f2bd40) ... [ 373.460628] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress! [ 373.460673] scsi target16:0:1: target reset: FAILED scmd(c01fd4f2bd40) [ 373.460720] mpt3sas_cm0: attempting host reset! scmd(c01fd4f2bd40) [ 373.460764] sd 16:0:1:0: [sdf] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00 [ 373.460821] Unable to handle kernel paging request for data at address 0xd0003800817d [ 373.460876] Faulting instruction address: 0xd00017b169b8 [ 373.460921] Oops: Kernel access of bad area, sig: 11 [#1] ... [ 373.462028] NIP [d00017b169b8] mpt3sas_base_get_iocstate+0x38/0xb0 [mpt3sas] [ 373.462085] LR [d00017b1a3f0] mpt3sas_base_hard_reset_handler+0x1a0/0x6d0 [mpt3sas] [ 373.462138] Call Trace: [ 373.462160] [c01fdf5f7ab0] [d00017b1a3f0] mpt3sas_base_hard_reset_handler+0x1a0/0x6d0 [mpt3sas] [ 373.462225] [c01fdf5f7b80] [d00017b20f6c] scsih_host_reset+0x7c/0x100 [mpt3sas] [ 373.462281] [c01fdf5f7c00] [c076f34c] scsi_try_host_reset+0x5c/0x140 [ 373.462335] [c01fdf5f7c40] [c077107c] scsi_eh_ready_devs+0x73c/0x960 [ 373.462390] [c01fdf5f7d10] [c0772800] scsi_error_handler+0x4c0/0x4e0 [ 373.462444] [c01fdf5f7dc0] [c0107ed4] kthread+0x164/0x1b0 [ 373.462490] [c01fdf5f7e30] [c000b5a0] ret_from_kernel_thread+0x5c/0xbc [ 373.468473] Instruction dump: With patch: [ 332.994654] setting logging_level(0x1000) [ 333.024246] mpt3sas_cm0: skip free_smid/scsi_done scmd(c03c97348140) [ 334.232041] sd 16:0:1:0: [sdf] Synchronizing SCSI cache [ 334.232324] sd 16:0:0:0: [sde] Synchronizing SCSI cache [ 334.232598] mpt3sas_cm0: sending diag reset !! [ 335.352533] mpt3sas_cm0: diag reset: SUCCESS [ 335.372075] mpt3sas_cm0: sleep on shutdown [ 336.440544] mpt3sas_cm0: sleep on shutdown ... [ 343.100179] sd 16:0:1:0: attempting task abort! scmd(c03c97348140) ... [ 343.100488] sd 16:0:1:0: device been deleted! scmd(c03c97348140) [ 343.100532] sd 16:0:1:0: task abort: SUCCESS scmd(c03c97348140) [ 343.140185] sd 16:0:1:0: [sdf] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [ 343.140275] sd 16:0:1:0: [sdf] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00 [ 343.140338] print_req_error: I/O error, dev sdf, sector 0 scsih_dev/target/host_reset() === Without patch: [ 101.077946] setting logging_level(0x1100) [ 101.110029] mpt3sas_cm0: skip free_smid/scsi_done scmd(c01fb7a2dd40) [ 102.356108] sd 16:0:1:0: [sdf] Synchronizing SCSI cache [ 102.356396] sd 16:0:0:0: [sde] Synchronizing SCSI cache [ 102.356647] mpt3sas_cm0: sending diag reset !! [ 103.476380] mpt3sas_cm0: diag reset: SUCCESS [ 103.492696] mpt3sas_cm0: sleep on shutdown [ 104.512914] mpt3sas_cm0: sleep on shutdown ... [ 111.732912] sd 16:0:1:0: attempting task abort! scmd(c01fb7a2dd40) ... [ 111.733202] mpt3sas_cm0: fail task abort scmd(c01fb7a2dd40) [ 111.733246] sd 16:0:1:0: task abort: FAILED scmd(c01fb7a2dd40) [ 111.772919] sd 16:0:1:0: attempting device reset! scmd(c01fb7a2dd40) ... [ 111.773177] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress! [ 111.773222] sd 16:0:1:0: device reset: FAILED scmd(c01fb7a2dd40) [ 111.773266] scsi target16:0:1: attempting target reset! scmd(c01fb7a2dd40) ... [ 111.773528] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress! [ 111.773574] scsi target16:0:1: target reset: FAILED scmd(c01fb7a2dd40) [ 111.773617] mpt3sas_cm0: attempting host reset! scmd(c01fb7a2dd40) [ 111.773662] sd 16:0:1:0: [sdf] tag#0 CDB:
[PATCH v2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload
This patch adds checks for 'ioc->remove_host' in the SCSI error handlers, so not to access pointers/resources potentially freed in the PCI shutdown/module unload path. The error handlers may be invoked after shutdown/unload, depending on other components. This problem was observed with kexec on a system with a mpt3sas based adapter and an infiniband adapter which takes long enough to shutdown: The mpt3sas driver finished shutting down / disabled interrupt handling, thus some commands have not finished and timed out. Since the system was still running (waiting for the infiniband adapter to shutdown), the scsi error handler for task abort of mpt3sas was invoked, and hit an oops -- either in scsih_abort() because 'ioc->scsi_lookup' was NULL (without commit dbec4c9040ed ("scsi: mpt3sas: lockless command submission")), or later up in scsih_host_reset() (with or without that commit), because it eventually called mpt3sas_base_get_iocstate(). After that commit, the oops in scsih_abort() does not occur anymore (_scsih_scsi_lookup_find_by_scmd() is no longer called), but that commit is too big and out of the scope of linux-stable, where this patch might help, so still go for the changes. Also, this might help to prevent similar errors in the future, in case code changes and possibly tries to access freed stuff. Note the fix in scsih_host_reset() is still important anyway. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- v2: - rebase on top of mkp/scsi.git's fixes branch - insert check for 'ioc->remove_host' in existing checks which already set DID_NO_CONNECT instead of duplicating that code. (helps with backports) - update commit message about that commit. drivers/scsi/mpt3sas/mpt3sas_scsih.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 74fca18..5ab3caf 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -2835,7 +2835,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, _scsih_tm_display_info(ioc, scmd); sas_device_priv_data = scmd->device->hostdata; - if (!sas_device_priv_data || !sas_device_priv_data->sas_target) { + if (!sas_device_priv_data || !sas_device_priv_data->sas_target || + ioc->remove_host) { sdev_printk(KERN_INFO, scmd->device, "device been deleted! scmd(%p)\n", scmd); scmd->result = DID_NO_CONNECT << 16; @@ -2898,7 +2899,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, _scsih_tm_display_info(ioc, scmd); sas_device_priv_data = scmd->device->hostdata; - if (!sas_device_priv_data || !sas_device_priv_data->sas_target) { + if (!sas_device_priv_data || !sas_device_priv_data->sas_target || + ioc->remove_host) { sdev_printk(KERN_INFO, scmd->device, "device been deleted! scmd(%p)\n", scmd); scmd->result = DID_NO_CONNECT << 16; @@ -2961,7 +2963,8 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, _scsih_tm_display_info(ioc, scmd); sas_device_priv_data = scmd->device->hostdata; - if (!sas_device_priv_data || !sas_device_priv_data->sas_target) { + if (!sas_device_priv_data || !sas_device_priv_data->sas_target || + ioc->remove_host) { starget_printk(KERN_INFO, starget, "target been deleted! scmd(%p)\n", scmd); scmd->result = DID_NO_CONNECT << 16; @@ -3019,7 +3022,7 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, ioc->name, scmd); scsi_print_command(scmd); - if (ioc->is_driver_loading) { + if (ioc->is_driver_loading || ioc->remove_host) { pr_info(MPT3SAS_FMT "Blocking the host reset\n", ioc->name); r = FAILED; -- 1.8.3.1
Re: [PATCH] test-case
On 01/31/2018 08:50 PM, Bart Van Assche wrote: I think it would be useful to have some variant of the above code in the kernel tree. Are you familiar with the fault injection framework (see also and Documentation/fault-injection/fault-injection.txt)? No, not yet. That's very interesting. Do you think that framework would be appropriate for controlling whether or not the above code gets executed? Yes, apparently it might be done w/ fail attributes to control the wait-loop in the shutdown/unload hook, whether to call scsi_done(), and whether to fail an error handler so it escalates to the next. I'd be happy to take a look at it, if you/others are interested. (just to make sure it does not end up rejected later because of others' opinion. :-) cheers, Mauricio
Re: [PATCH] scsi: mpt3sas: fix oops in error handlers after shutdown/unload
On 01/31/2018 08:59 PM, Bart Van Assche wrote: On Wed, 2018-01-31 at 17:48 -0200, Mauricio Faria de Oliveira wrote: On 01/31/2018 05:06 PM, Bart Van Assche wrote: Sorry but I think this patch introduces new race conditions. Have you Can you detail the race conditions? As far as I can see, the only race condition would be when an error handler is invoked very close in time to the shutdown/unload handler, and as such miss the 'ioc->remove_host' assignment (thus it continues running anyway, and hit the oops). That's indeed the race I was referring to. [snip] Okay. I'm not sure that point invalidates this patch; let me explain. First, IMHO that problem already exists in mpt3sas as it is now. There are 17 checks for 'ioc->remove_host' to return early, in the same manner. AFAIK, those are subject to this same race condition. $ grep -r 'ioc->remove_host[^=]*$' drivers/scsi/mpt3sas/ | wc -l 17 Second, I don't think this patch makes the driver any worse. Even with the potential race condition (which already exists elsewhere in the driver), it reduces the chance to hit an oops from _every time_ to only when the race condition occurs _and_ the error handler looses. So, even not being completely safe (like other parts of the driver), it still does help. BTW, are you aware that your patch does not seem to apply on Martin's tree (the fixes branch of git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git): > [snip] Can you rebase this patch on top of Martin's tree? Yes, sure. I'll rebase, test, and send v2. Thanks for the pointer. cheers, Mauricio
Re: [PATCH] scsi: mpt3sas: fix oops in error handlers after shutdown/unload
Bart, Thanks for reviewing. On 01/31/2018 05:06 PM, Bart Van Assche wrote: Sorry but I think this patch introduces new race conditions. Have you Can you detail the race conditions? As far as I can see, the only race condition would be when an error handler is invoked very close in time to the shutdown/unload handler, and as such miss the 'ioc->remove_host' assignment (thus it continues running anyway, and hit the oops). But this problem has never been reported, so I'd think that case would be rare. Not that it does not need to be perfectly solved, but _if_ that is the only problem with the patch (and _if_ I got that right), then it would still be better than the current code in oops. > considered to modify the mpt3sas driver such that interrupts are only > disabled after all commands have finished? Yes, I considered introducing waits in several points, but since the driver is tearing down, the only point which it seem to make sense to insert is right at the beginning of shutdown/unload -- but it seemed not sufficient: imagine not all commands finish, which would block that path, so we need a time out mechanism, but that does not guarantee the call back won't occur later on (say, a while after we gave up), so we would still need a guard at the callbacks. I also considered checking for the particular pointer which is set to NULL and cause the Oops, but that is a series of 10ish (small) patches which must check 'ioc->scsi_lookup' in a spinlock, so it seemed a bit bigger than what was needed.. and would still be prone to hitting an oops due to another pointer at a later time. So after thinking about those, I went with the simplest approach. I'd be happy to consider the race conditions you thought of, provided more detail. thanks, Mauricio
[PATCH] test-case
This patch can be verified with this simple test-case, which inserts a wait loop at the bottom of 'scsih_shutdown()' and forces SCSI commands to timeout (skip 'scmd->scsi_done()'). It abuses the 'ioc->logging_level' parameter do to that, with: - 0x1000: wait loop on scsih_shutdown() and skip scsi_done() - 0x0100: force scsih_abort() to return FAILED early, so to run device/target/host reset. Oops in scsih_abort() = Test-case: # echo 0x1000 > /sys/module/mpt3sas/parameters/logging_level # dd if=/dev/sdf of=/dev/null count=1 iflag=direct & # kexec --force --append="$(cat /proc/cmdline)" --initrd=/boot/initrd.img-4.15.0 /boot/vmlinux-4.15.0 Without patch: [ 141.936251] setting logging_level(0x1000) [ 141.977920] mpt3sas_cm0: skip scsi_done scmd(a85f0166) [ 147.927561] sd 16:0:1:0: [sdf] Synchronizing SCSI cache [ 147.927831] sd 16:0:0:0: [sde] Synchronizing SCSI cache [ 147.928090] mpt3sas_cm0: sending diag reset !! [ 149.041346] mpt3sas_cm0: diag reset: SUCCESS [ 149.057985] mpt3sas_cm0: sleep on shutdown [ 150.098619] mpt3sas_cm0: sleep on shutdown [ 151.138571] mpt3sas_cm0: sleep on shutdown ... [ 171.938245] mpt3sas_cm0: sleep on shutdown [ 172.678231] sd 16:0:1:0: attempting task abort! scmd(a85f0166) ... [ 172.678545] Unable to handle kernel paging request for data at address 0x0008 [ 172.678600] Faulting instruction address: 0xd0001789e8c0 [ 172.678648] Oops: Kernel access of bad area, sig: 11 [#1] ... [ 172.679804] NIP [d0001789e8c0] scsih_abort+0xc0/0x290 [mpt3sas] [ 172.679854] LR [d0001789e8a8] scsih_abort+0xa8/0x290 [mpt3sas] [ 172.679903] Call Trace: [ 172.679926] [c01fed68fbc0] [d0001789e8a8] scsih_abort+0xa8/0x290 [mpt3sas] (unreliable) [ 172.679994] [c01fed68fc50] [c075a274] scmd_eh_abort_handler+0xc4/0x1a0 [ 172.680053] [c01fed68fc90] [c00fea88] process_one_work+0x188/0x450 [ 172.680109] [c01fed68fd20] [c00fede8] worker_thread+0x98/0x550 [ 172.680157] [c01fed68fdc0] [c0107344] kthread+0x164/0x1b0 [ 172.680206] [c01fed68fe30] [c000b6e0] ret_from_kernel_thread+0x5c/0x7c [ 172.680261] Instruction dump: ... With patch: [ 233.259952] setting logging_level(0x1000) [ 233.290008] mpt3sas_cm0: skip scsi_done scmd(7ec97dda) [ 234.600934] sd 16:0:1:0: [sdf] Synchronizing SCSI cache [ 234.601222] sd 16:0:0:0: [sde] Synchronizing SCSI cache [ 234.601470] mpt3sas_cm0: sending diag reset !! [ 235.718433] mpt3sas_cm0: diag reset: SUCCESS [ 235.734534] mpt3sas_cm0: sleep on shutdown [ 236.805704] mpt3sas_cm0: sleep on shutdown [ 237.845708] mpt3sas_cm0: sleep on shutdown ... [ 263.845781] mpt3sas_cm0: sleep on shutdown [ 264.185782] sd 16:0:1:0: attempting task abort! scmd(7ec97dda) ... [ 264.186104] sd 16:0:1:0: task abort: SUCCESS scmd(7ec97dda) [ 264.225788] sd 16:0:1:0: [sdf] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [ 264.225910] sd 16:0:1:0: [sdf] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00 [ 264.225969] print_req_error: I/O error, dev sdf, sector 0 Oops in scsih_host_reset() == Test-case: # echo 0x1100 > /sys/module/mpt3sas/parameters/logging_level # dd if=/dev/sdf of=/dev/null count=1 iflag=direct & # kexec --force --append="$(cat /proc/cmdline)" --initrd=/boot/initrd.img-4.15.0 /boot/vmlinux-4.15.0 Without patch: [ 241.734670] setting logging_level(0x1100) [ 251.587765] mpt3sas_cm0: skip scsi_done scmd(60a524f9) [ 252.771054] sd 16:0:1:0: [sdf] Synchronizing SCSI cache [ 252.771335] sd 16:0:0:0: [sde] Synchronizing SCSI cache [ 252.771582] mpt3sas_cm0: sending diag reset !! [ 253.889275] mpt3sas_cm0: diag reset: SUCCESS [ 253.906315] mpt3sas_cm0: sleep on shutdown [ 254.966487] mpt3sas_cm0: sleep on shutdown [ 256.006417] mpt3sas_cm0: sleep on shutdown ... [ 282.005452] mpt3sas_cm0: sleep on shutdown [ 282.105416] sd 16:0:1:0: attempting task abort! scmd(60a524f9) ... [ 282.105707] mpt3sas_cm0: fail task abort scmd(60a524f9) [ 282.105754] sd 16:0:1:0: task abort: FAILED scmd(60a524f9) [ 282.105811] sd 16:0:1:0: attempting device reset! scmd(60a524f9) ... [ 282.106087] mpt3sas_scsih_issue_tm: mpt3sas_cm0: host reset in progress! [ 282.106136] sd 16:0:1:0: device reset: FAILED
[PATCH] scsi: mpt3sas: fix oops in error handlers after shutdown/unload
This patch adds checks for 'ioc->remove_host' in the SCSI error handlers, so not to access pointers/resources potentially freed in the PCI shutdown/module unload path. The error handlers may be invoked after shutdown/unload, depending on other components. This problem was observed with kexec on a system with a mpt3sas based adapter and an infiniband adapter which takes long enough to shutdown. The mpt3sas driver finished shuttting down, which disabled interruption handling, thus some running commands have not finished and timed out. Since the system was still running, waiting for the infiniband adapter to shut down, the scsi error handler for task abort of mpt3sas was invoked, and hit the oops because 'ioc->scsi_lookup' was NULL. During patch testing, a different oops happens if host reset is reached (when it eventually calls 'mpt3sas_base_get_iocstate()'). The device reset and target reset handlers do not cause oopses, but print a misleading message of host reset in progress, thus fix those too. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 28 1 file changed, 28 insertions(+) diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index b258f21..3c4e47c 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -3007,6 +3007,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, goto out; } + if (ioc->remove_host) { + scmd->result = DID_NO_CONNECT << 16; + scmd->scsi_done(scmd); + r = SUCCESS; + goto out; + } + /* search for the command */ smid = _scsih_scsi_lookup_find_by_scmd(ioc, scmd); if (!smid) { @@ -3069,6 +3076,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, goto out; } + if (ioc->remove_host) { + scmd->result = DID_NO_CONNECT << 16; + scmd->scsi_done(scmd); + r = SUCCESS; + goto out; + } + /* for hidden raid components obtain the volume_handle */ handle = 0; if (sas_device_priv_data->sas_target->flags & @@ -3131,6 +3145,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, goto out; } + if (ioc->remove_host) { + scmd->result = DID_NO_CONNECT << 16; + scmd->scsi_done(scmd); + r = SUCCESS; + goto out; + } + /* for hidden raid components obtain the volume_handle */ handle = 0; if (sas_device_priv_data->sas_target->flags & @@ -3179,6 +3200,13 @@ int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle, ioc->name, scmd); scsi_print_command(scmd); + if (ioc->remove_host) { + scmd->result = DID_NO_CONNECT << 16; + scmd->scsi_done(scmd); + r = FAILED; + goto out; + } + if (ioc->is_driver_loading) { pr_info(MPT3SAS_FMT "Blocking the host reset\n", ioc->name); -- 1.8.3.1
Re: [PATCH v2 1/4] scsi: scsi_dh_alua: allow I/O in target port unavailable and standby states
On 07/11/2017 12:32 PM, Mauricio Faria de Oliveira wrote: Also, it seems the Unavailable/Standby states would not be logged without a recheck from alua_check_sense(), since the only callers of alua_rtpg_queue() are alua_activate() and alua_check[_sense]() Well, actually it does get logged if when activating/switching path groups but shouldn't be the case with only a single path group. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v2 1/4] scsi: scsi_dh_alua: allow I/O in target port unavailable and standby states
On 07/11/2017 06:18 AM, Hannes Reinecke wrote: NACK. The whole_point_ of having device handlers is to_avoid_ I/O errors during booting. And the ALUA checker is prepared to handle this situation properly. The directio checker of course doesn't know about this, but then no-one expected the directio checker to work with ALUA. I lacked that more holistic understanding. Thanks for explaining. Now, for the sake of logging/debugging... Any problem with patches 2 and 4? Also, it seems the Unavailable/Standby states would not be logged without a recheck from alua_check_sense(), since the only callers of alua_rtpg_queue() are alua_activate() and alua_check[_sense]() [the call from alua_check_vpd() is only in the initialization path]. Isn't there a point in scheduling a recheck once those conditions are found in alua_check_sense() to get them logged? - since valid path checkers won't go through that function. (and it occurred to me that the state-change check of patch 3 can be done there, simpler.) cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v2 0/4] scsi_dh_alua: fix stuck I/O after unavailable/standby states
On 07/10/2017 07:47 PM, Mauricio Faria de Oliveira wrote: This patchset addresses that problem, and adds a few improvements to the logging of PG state changes. Here are some kernel log snippets with the patchset, if that helps. The 2 port groups temporarily gone into unavailable state, and recovered to active states later on, when I/O has been restored. The devices, port groups, and relative port IDs: 2 LUNs, 2 targets per LUN, 2 hosts (8 paths total). [ 294.482107] scsi 4:0:0:0: alua: device naa.60050764008100dac104 port group 1 rel port 881 [ 294.642063] scsi 4:0:0:1: alua: device naa.60050764008100dac105 port group 0 rel port 881 [ 294.812140] scsi 4:0:1:0: alua: device naa.60050764008100dac104 port group 0 rel port 81 [ 295.002027] scsi 4:0:1:1: alua: device naa.60050764008100dac105 port group 1 rel port 81 [ 296.952092] scsi 5:0:0:0: alua: device naa.60050764008100dac104 port group 1 rel port 881 [ 297.162043] scsi 5:0:0:1: alua: device naa.60050764008100dac105 port group 0 rel port 881 [ 297.312130] scsi 5:0:1:0: alua: device naa.60050764008100dac104 port group 0 rel port 81 [ 297.512006] scsi 5:0:1:1: alua: device naa.60050764008100dac105 port group 1 rel port 81 First, mpatha goes down: mpatha (360050764008100dac104) dm-6 IBM ,2145 size=40G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=0 status=active | |- 4:0:1:0 sde 8:64 active undef unknown | `- 5:0:1:0 sdi 8:128 active undef unknown `-+- policy='round-robin 0' prio=0 status=enabled |- 4:0:0:0 sdc 8:32 active undef unknown `- 5:0:0:0 sdg 8:96 active undef unknown [258929.940734] device-mapper: multipath: Failing path 8:64. [258929.940751] device-mapper: multipath: Failing path 8:128. [258929.940777] sd 4:0:0:0: alua: alua_activate(): port group 01, fn: pg_init_done [dm_multipath]() [258929.940784] sd 5:0:0:0: alua: alua_activate(): port group 01, fn: pg_init_done [dm_multipath]() [258929.941575] sd 5:0:1:0: alua: alua_check_sense(): Sense Key: 0x2, ASC: 0x4, ASCQ: 0xc [258929.941578] sd 5:0:1:0: [sdi] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [258929.941874] sd 5:0:1:0: [sdi] tag#1 Sense Key : Not Ready [current] [258929.941928] sd 5:0:1:0: [sdi] tag#1 Add. Sense: Logical unit not accessible, target port in unavailable state [258929.942096] sd 4:0:1:0: alua: alua_check_sense(): Sense Key: 0x2, ASC: 0x4, ASCQ: 0xc [258929.942104] sd 4:0:1:0: [sde] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [258929.942300] sd 4:0:1:0: [sde] tag#2 Sense Key : Not Ready [current] [258929.942354] sd 4:0:1:0: [sde] tag#2 Add. Sense: Logical unit not accessible, target port in unavailable state [258929.956347] sd 4:0:1:0: alua: port group 00 state U non-preferred supports tolusna [258929.956389] sd 4:0:1:0: alua: port group 01 state U non-preferred [258930.036355] sd 5:0:0:0: alua: alua_check_sense(): Sense Key: 0x2, ASC: 0x4, ASCQ: 0xc [258930.036360] sd 5:0:0:0: alua: alua_rtpg_queue(): port group 01 [258930.036365] device-mapper: multipath: Failing path 8:96. [258930.036444] sd 4:0:0:0: alua: alua_check_sense(): Sense Key: 0x2, ASC: 0x4, ASCQ: 0xc [258930.036506] device-mapper: multipath: Failing path 8:32. Then, mpathb goes down: mpathb (360050764008100dac105) dm-7 IBM ,2145 size=40G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=0 status=active | |- 4:0:0:1 sdd 8:48 active undef unknown | `- 5:0:0:1 sdh 8:112 active undef unknown `-+- policy='round-robin 0' prio=0 status=enabled |- 4:0:1:1 sdf 8:80 active undef unknown `- 5:0:1:1 sdj 8:144 active undef unknown [258930.162558] sd 4:0:0:1: alua: alua_check_sense(): Sense Key: 0x2, ASC: 0x4, ASCQ: 0xc [258930.162568] sd 4:0:0:1: alua: alua_rtpg_queue(): port group 00 [258930.162572] sd 4:0:0:1: [sdd] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [258930.162731] sd 4:0:0:1: [sdd] tag#9 Sense Key : Not Ready [current] [258930.162762] sd 4:0:0:1: [sdd] tag#9 Add. Sense: Logical unit not accessible, target port in unavailable state [258930.162860] device-mapper: multipath: Failing path 8:48. [258930.162979] sd 5:0:0:1: [sdh] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [258930.162980] sd 5:0:0:1: [sdh] tag#1 Sense Key : Not Ready [current] [258930.162982] sd 5:0:0:1: [sdh] tag#1 Add. Sense: Logical unit not accessible, target port in unavailable state [258930.162992] device-mapper: multipath: Failing path 8:112. [258930.163025] sd 4:0:1:1: alua: alua_activate(): port group 01, fn: pg_init_done [dm_multipath]() [258930.163029] sd 4:0:1:1: alua: alua_rtpg_queue(): port group 01 [258930.163048] sd 5:0:1:1: alua: alua_activate(): port group 01, fn: pg_init_done [dm_multipath]() [258930.176323] sd 4:0:0:1: alua: port group 00 state U non-preferred supports tolusna
[PATCH v2 2/4] scsi: scsi_dh_alua: print changes to RTPG state of all PGs
Currently, alua_rtpg() can change the 'state' and 'preferred' values for the current port group _and_ of other port groups. However, it reports that _only_ for the current port group. This might cause uncertainty and confusion when going through the kernel logs for analyzing/debugging scsi_dh_alua behavior, which is not helpful during support and development scenarios. So, print of such changes for all port groups, when it occurs. It's possible to distinguish between the messages printed for the current and other PGs by the 'supports' (supported states) part, which is only printed for the current PG. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- v2: - use lockdep_assert_held() instead of documenting locking conventions (Bart Van Assche <bart.vanass...@sandisk.com>) - define two functions (with/without supported states information) (Bart Van Assche <bart.vanass...@sandisk.com>) - simplify which device is used for printing the messages (use the evaluated scsi device and tell which PG the info is for) - call the print functions from nearby places (right after modifying the PG data) drivers/scsi/device_handler/scsi_dh_alua.c | 63 +++--- 1 file changed, 49 insertions(+), 14 deletions(-) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index a1cf3d6aa853..937341ddb767 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -529,6 +529,49 @@ static int alua_tur(struct scsi_device *sdev) } /* + * alua_rtpg_print - Print REPORT TARGET GROUP STATES information + * (without supported states) + * @sdev: the device evaluated (source of information). + * @pg: the port group associated with the information. + */ +static void alua_rtpg_print(struct scsi_device *sdev, + struct alua_port_group *pg) +{ + lockdep_assert_held(>lock); + + sdev_printk(KERN_INFO, sdev, + "%s: port group %02x state %c %s\n", + ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state), + pg->pref ? "preferred" : "non-preferred"); +} + +/* + * alua_rtpg_print_supported - Print REPORT TARGET GROUP STATES information + *(with supported states) + * @sdev: the device evaluated (source of information). + * @pg: the port group associated with the information. + * @supported_states: the supported states information. + */ +static void alua_rtpg_print_supported(struct scsi_device *sdev, + struct alua_port_group *pg, + int supported_states) +{ + lockdep_assert_held(>lock); + + sdev_printk(KERN_INFO, sdev, + "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n", + ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state), + pg->pref ? "preferred" : "non-preferred", + supported_states & TPGS_SUPPORT_TRANSITION? 'T' : 't', + supported_states & TPGS_SUPPORT_OFFLINE ? 'O' : 'o', + supported_states & TPGS_SUPPORT_LBA_DEPENDENT ? 'L' : 'l', + supported_states & TPGS_SUPPORT_UNAVAILABLE ? 'U' : 'u', + supported_states & TPGS_SUPPORT_STANDBY ? 'S' : 's', + supported_states & TPGS_SUPPORT_NONOPTIMIZED ? 'N' : 'n', + supported_states & TPGS_SUPPORT_OPTIMIZED ? 'A' : 'a'); +} + +/* * alua_rtpg - Evaluate REPORT TARGET GROUP STATES * @sdev: the device to be evaluated. * @@ -540,7 +583,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) { struct scsi_sense_hdr sense_hdr; struct alua_port_group *tmp_pg; - int len, k, off, valid_states = 0, bufflen = ALUA_RTPG_SIZE; + int len, k, off, bufflen = ALUA_RTPG_SIZE; unsigned char *desc, *buff; unsigned err, retval; unsigned int tpg_desc_tbl_off; @@ -674,9 +717,12 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) h->sdev->access_state = desc[0]; } rcu_read_unlock(); + + if (tmp_pg == pg) + alua_rtpg_print_supported(sdev, tmp_pg, desc[1]); + else + alua_rtpg_print(sdev, tmp_pg); } - if (tmp_pg == pg) - valid_states = desc[1]; spin_unlock_irqrestore(_pg->lock, f
[PATCH v2 1/4] scsi: scsi_dh_alua: allow I/O in target port unavailable and standby states
According to SPC-4 (5.15.2.4.5 Unavailable state), the unavailable state may (or may not) transition to other states (e.g., microcode downloading or hardware error, which may be temporary or permanent). But, scsi_dh_alua currently fails I/O requests early on once that state occurs (in alua_prep_fn()) preventing path checkers in such function path to actually check if I/O still fails or now works. And that prevents a path activation (alua_activate()) which could update the PG state if it eventually recovered to an active state, thus resume I/O. (This is also the case with the standby state.) This might cause device-mapper multipath to fail all paths to some storage system that moves the controllers to the unavailable state for firmware upgrades, and never recover regardless of the storage system doing upgrades one controller at a time and get them online. Then I/O requests are blocked indefinitely due to queue_if_no_path but the underlying individual paths are fully operational, and can be verified as such through other function paths (e.g., SG_IO): # multipath -l mpatha (360050764008100dac100) dm-0 IBM,2145 size=40G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | |- 1:0:1:0 sdf 8:80 failed undef running | `- 2:0:1:0 sdn 8:208 failed undef running `-+- policy='service-time 0' prio=0 status=enabled |- 1:0:0:0 sdb 8:16 failed undef running `- 2:0:0:0 sdj 8:144 failed undef running # strace -e read \ sg_dd blk_sgio=0 \ if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \ 2>&1 | grep 512 read(3, 0x3fff7ba8, 512) = -1 EIO (Input/output error) # strace -e ioctl \ sg_dd blk_sgio=1 \ if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \ 2>&1 | grep 512 ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[10]=[28, 00, 00, 00, 00, 00, 00, 00, 01, 00], <...>) = 0 So, allow I/O to paths of PGs in unavailable/standby state, so path checkers can actually check them. Also, schedule a recheck when unavailable/standby state is detected (in alua_check_sense()) to update pg->state, and quiet further SCSI error messages (in alua_prep_fn()). Once a path checker eventually detects a working/active state again, the PG state is normally updated on path activation (alua_activate(), as it schedules a recheck), thus I/O requests are no longer failed. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Reported-by: Naresh Bannoth <nbann...@in.ibm.com> --- v2: - also add support for standby state to alua_check_sense(), alua_prep_fn() (Bart Van Assche <bart.vanass...@sandisk.com>) drivers/scsi/device_handler/scsi_dh_alua.c | 25 + 1 file changed, 25 insertions(+) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index c01b47e5b55a..a1cf3d6aa853 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -431,6 +431,26 @@ static int alua_check_sense(struct scsi_device *sdev, alua_check(sdev, false); return NEEDS_RETRY; } + if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0b) { + /* +* LUN Not Accessible - target port in standby state. +* +* Do not retry, so failover to another target port occur. +* Schedule a recheck to update state for other functions. +*/ + alua_check(sdev, true); + return SUCCESS; + } + if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0c) { + /* +* LUN Not Accessible - target port in unavailable state. +* +* Do not retry, so failover to another target port occur. +* Schedule a recheck to update state for other functions. +*/ + alua_check(sdev, true); + return SUCCESS; + } break; case UNIT_ATTENTION: if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00) { @@ -1057,6 +1077,8 @@ static void alua_check(struct scsi_device *sdev, bool force) * * Fail I/O to all paths not in state * active/optimized or active/non-optimized. + * Allow I/O to paths in state unavailable/standby + * so path checkers can actually check them. */ static int alua_prep_fn(struct scsi_device *sdev, struct request *req) { @@ -1072,6 +1094,9 @@ static int alua_prep_fn(struct scsi_device *sdev, struct request *req) rcu_read_unlock(); if (state ==
[PATCH v2 3/4] scsi: scsi_dh_alua: do not print RTPG state if it remains unavailable/standby
Path checkers will check paths of a port group in unavailable/standby state more frequently (as they are 'failed') - possibly for a long or indefinite period of time, and/or for a large number of paths. That might flood the kernel log with scsi_dh_alua RTPG state messages, due to the recheck scheduled in alua_check_sense() to update PG state. So, do not to print such message if unavailable/standby state remains (i.e., the PG did not transition to/from such states). All other cases continue to be printed. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- v2: - changed v1's alua_state_remains() into alua_rtpg_print_check(), which includes in the states (not) to print for (deduplicates code). - added support for Standby state too (requested in patch 1 of this set) (Bart Van Assche <bart.vanass...@sandisk.com>) - remove superfluous parentheses in return statement. (Bart Van Assche <bart.vanass...@sandisk.com>) - remove likely() hint (Bart Van Assche <bart.vanass...@sandisk.com>) drivers/scsi/device_handler/scsi_dh_alua.c | 35 ++ 1 file changed, 31 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index 937341ddb767..e0bf0827a980 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -572,6 +572,30 @@ static void alua_rtpg_print_supported(struct scsi_device *sdev, } /* + * alua_rtpg_print_check - Whether to print RTPG state information + *on particular state transitions. + * @old_state: the old state value. + * @new_state: the new state value. + */ +static bool alua_rtpg_print_check(int old_state, int new_state) +{ + /* +* Do not print RTPG state information in case +* state remains either unavailable or standby. +* +* Print it for everything else. +*/ + + switch (old_state) { + case SCSI_ACCESS_STATE_UNAVAILABLE: + case SCSI_ACCESS_STATE_STANDBY: + return old_state != new_state; + default: + return 1; + } +} + +/* * alua_rtpg - Evaluate REPORT TARGET GROUP STATES * @sdev: the device to be evaluated. * @@ -706,6 +730,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) if ((tmp_pg == pg) || !(tmp_pg->flags & ALUA_PG_RUNNING)) { struct alua_dh_data *h; + int tmp_pg_state_orig = tmp_pg->state; tmp_pg->state = desc[0] & 0x0f; tmp_pg->pref = desc[0] >> 7; @@ -718,10 +743,12 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) } rcu_read_unlock(); - if (tmp_pg == pg) - alua_rtpg_print_supported(sdev, tmp_pg, desc[1]); - else - alua_rtpg_print(sdev, tmp_pg); + if (alua_rtpg_print_check(tmp_pg_state_orig, tmp_pg->state)) { + if (tmp_pg == pg) + alua_rtpg_print_supported(sdev, tmp_pg, desc[1]); + else + alua_rtpg_print(sdev, tmp_pg); + } } spin_unlock_irqrestore(_pg->lock, flags); } -- 1.8.3.1
[PATCH v2 4/4] scsi: scsi_dh_alua: add sdev_dbg() to track alua_rtpg_work()
Insert sdev_dbg() calls in the function path which may queue alua_rtpg_work() past initialization, for debugging purposes: - alua_activate() - alua_check_sense() - alua_rtpg_queue() Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/device_handler/scsi_dh_alua.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index e0bf0827a980..5bf0d55f6eb1 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -422,6 +422,10 @@ static char print_alua_state(unsigned char state) static int alua_check_sense(struct scsi_device *sdev, struct scsi_sense_hdr *sense_hdr) { + sdev_dbg(sdev, "%s: %s(): Sense Key: 0x%x, ASC: 0x%x, ASCQ: 0x%x\n", +ALUA_DH_NAME, __func__, sense_hdr->sense_key, +sense_hdr->asc, sense_hdr->ascq); + switch (sense_hdr->sense_key) { case NOT_READY: if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) { @@ -987,10 +991,13 @@ static bool alua_rtpg_queue(struct alua_port_group *pg, spin_unlock_irqrestore(>lock, flags); if (start_queue) { + if (queue_delayed_work(alua_wq, >rtpg_work, - msecs_to_jiffies(ALUA_RTPG_DELAY_MSECS))) + msecs_to_jiffies(ALUA_RTPG_DELAY_MSECS))) { + sdev_dbg(sdev, "%s: %s(): port group %02x\n", +ALUA_DH_NAME, __func__, pg->group_id); sdev = NULL; - else + } else kref_put(>kref, release_port_group); } if (sdev) @@ -1100,6 +1107,9 @@ static int alua_activate(struct scsi_device *sdev, rcu_read_unlock(); mutex_unlock(>init_mutex); + sdev_dbg(sdev, "%s: %s(): port group %02x, fn: %pf()\n", +ALUA_DH_NAME, __func__, pg->group_id, fn); + if (alua_rtpg_queue(pg, sdev, qdata, true)) fn = NULL; else -- 1.8.3.1
[PATCH v2 0/4] scsi_dh_alua: fix stuck I/O after unavailable/standby states
Currently, scsi_dh_alua fails I/O requests early on once ALUA state unavailable/standby occur, which prevents path checkers to actually check if I/O still fails or now works. Then I/O requests are blocked indefinitely due to queue_if_no_path but the underlying individual paths are fully operational, and can be verified as such otherways (e.g., SG_IO). This patchset addresses that problem, and adds a few improvements to the logging of PG state changes. Patch 1 fixes the problem. Patch 2 makes sure that state changes for all PGs are logged. Patch 3 makes sure that state no-changes for PGs in unavailable/standby are not logged - only changes are. Patch 4 adds few sdev_dbg() calls to track the path to alua_rtpg_work() Tested on v4.12+ (commit b4b8cbf679c4). Mauricio Faria de Oliveira (4): scsi: scsi_dh_alua: allow I/O in target port unavailable and standby states scsi: scsi_dh_alua: print changes to RTPG state of all PGs scsi: scsi_dh_alua: do not print RTPG state if it remains unavailable/standby scsi: scsi_dh_alua: add sdev_dbg() to track alua_rtpg_work() drivers/scsi/device_handler/scsi_dh_alua.c | 129 + 1 file changed, 113 insertions(+), 16 deletions(-) -- 1.8.3.1
Re: [PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
This is the PATCH v2. Sorry for the wrong subject line. On 04/11/2017 11:46 AM, Mauricio Faria de Oliveira wrote: Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Acked-by: Brian King <brk...@linux.vnet.ibm.com> --- v2: - use the scsi_cmd local variable rather than ipr_cmd->scsi_cmd dereference. - add Acked-by: Brian King. -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
On a dual controller setup with multipath enabled, some MEDIUM ERRORs caused both paths to be failed, thus I/O got queued/blocked since the 'queue_if_no_path' feature is enabled by default on IPR controllers. This example disabled 'queue_if_no_path' so the I/O failure is seen at the sg_dd program. Notice that after the sg_dd test-case, both paths are in 'failed' state, and both path/priority groups are in 'enabled' state (not 'active') -- which would block I/O with 'queue_if_no_path'. # sg_dd if=/dev/dm-2 bs=4096 count=1 dio=1 verbose=4 blk_sgio=0 <...> read(unix): count=4096, res=-1 sg_dd: reading, skip=0 : Input/output error <...> # dmesg [...] sd 2:2:16:0: [sds] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [...] sd 2:2:16:0: [sds] Sense Key : Medium Error [current] [...] sd 2:2:16:0: [sds] Add. Sense: Unrecovered read error - recommend rewrite the data [...] sd 2:2:16:0: [sds] CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 [...] blk_update_request: I/O error, dev sds, sector 0 [...] device-mapper: multipath: Failing path 65:32. <...> [...] device-mapper: multipath: Failing path 65:224. # multipath -l 1IBM_IPR-0_59C2AE001F80 dm-2 IBM ,IPR-0 59C2AE00 size=5.2T features='0' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | `- 2:2:16:0 sds 65:32 failed undef running `-+- policy='service-time 0' prio=0 status=enabled `- 1:2:7:0 sdae 65:224 failed undef running This is not the desired behavior. The dm-multipath explicitly checks for the MEDIUM ERROR case (and a few others) so not to fail the path (e.g., I/O to other sectors could potentially happen without problems). See dm-mpath.c :: do_end_io_bio() -> noretry_error() !->! fail_path(). The problem trace is: 1) ipr_scsi_done() // SENSE KEY/CHECK CONDITION detected, go to.. 2) ipr_erp_start() // ipr_is_gscsi() and masked_ioasc OK, go to.. 3) ipr_gen_sense() // masked_ioasc is IPR_IOASC_MED_DO_NOT_REALLOC, // so set DID_PASSTHROUGH. 4) scsi_decide_disposition() // check for DID_PASSTHROUGH and return // early on, faking a DID_OK.. *instead* // of reaching scsi_check_sense(). // Had it reached the latter, that would // set host_byte to DID_MEDIUM_ERROR. 5) scsi_finish_command() 6) scsi_io_completion() 7) __scsi_error_from_host_byte() // That would be converted to -ENODATA <...> 8) dm_softirq_done() 9) multipath_end_io() 10) do_end_io() 11) noretry_error() // And that is checked in dm-mpath :: noretry_error() // which would cause fail_path() not to be called. With this patch applied, the I/O is failed but the paths are not. This multipath device continues accepting more I/O requests without blocking. (and notice the different host byte/driver byte handling per SCSI layer). # dmesg [...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 driverbyte=DRIVER_OK [...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 40 00 [...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current] [...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend rewrite the data [...] blk_update_request: critical medium error, dev sdaf, sector 0 [...] blk_update_request: critical medium error, dev dm-6, sector 0 [...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 driverbyte=DRIVER_OK [...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 10 00 [...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current] [...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend rewrite the data [...] blk_update_request: critical medium error, dev sdaf, sector 0 [...] blk_update_request: critical medium error, dev dm-6, sector 0 [...] Buffer I/O error on dev dm-6, logical block 0, async page read # multipath -l 1IBM_IPR-0_59C2AE001F80 1IBM_IPR-0_59C2AE001F80 dm-6 IBM ,IPR-0 59C2AE00 size=5.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=active | `- 2:2:7:0 sdaf 65:240 active undef running `-+- policy='service-time 0' prio=0 status=enabled `- 1:2:7:0 sdh 8:112 active undef running Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Acked-by: Brian King <brk...@linux.vnet.ibm.com> --- v2: - use the scsi_cmd local variable rather than ipr_cmd->scsi_cmd dereference. - add Acked-by: Brian King. drivers/scsi/ipr.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c index b29afafc2885..5d5e272fd815 100644 --- a/drivers/scsi/ipr.c +++ b/drivers/scsi/ipr.c @@ -6293,7 +6293,12 @@ static void ipr_erp_start(struct ipr_ioa_cfg *ioa_cfg, break;
Re: [PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
On 04/11/2017 11:34 AM, Brian King wrote: Looks good to me. Thanks for the detailed analysis. One minor nit below and you can add: Acked-by: Brian King <brk...@linux.vnet.ibm.com> Cool, thanks. I'll submit a v2. Since we already have a scsi_cmd local, you don't need to go back to the ipr_cmd to get the pointer, so could be: Thanks for catching that oversight. -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
On a dual controller setup with multipath enabled, some MEDIUM ERRORs caused both paths to be failed, thus I/O got queued/blocked since the 'queue_if_no_path' feature is enabled by default on IPR controllers. This example disabled 'queue_if_no_path' so the I/O failure is seen at the sg_dd program. Notice that after the sg_dd test-case, both paths are in 'failed' state, and both path/priority groups are in 'enabled' state (not 'active') -- which would block I/O with 'queue_if_no_path'. # sg_dd if=/dev/dm-2 bs=4096 count=1 dio=1 verbose=4 blk_sgio=0 <...> read(unix): count=4096, res=-1 sg_dd: reading, skip=0 : Input/output error <...> # dmesg [...] sd 2:2:16:0: [sds] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [...] sd 2:2:16:0: [sds] Sense Key : Medium Error [current] [...] sd 2:2:16:0: [sds] Add. Sense: Unrecovered read error - recommend rewrite the data [...] sd 2:2:16:0: [sds] CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 [...] blk_update_request: I/O error, dev sds, sector 0 [...] device-mapper: multipath: Failing path 65:32. <...> [...] device-mapper: multipath: Failing path 65:224. # multipath -l 1IBM_IPR-0_59C2AE001F80 dm-2 IBM ,IPR-0 59C2AE00 size=5.2T features='0' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | `- 2:2:16:0 sds 65:32 failed undef running `-+- policy='service-time 0' prio=0 status=enabled `- 1:2:7:0 sdae 65:224 failed undef running This is not the desired behavior. The dm-multipath explicitly checks for the MEDIUM ERROR case (and a few others) so not to fail the path (e.g., I/O to other sectors could potentially happen without problems). See dm-mpath.c :: do_end_io_bio() -> noretry_error() !->! fail_path(). The problem trace is: 1) ipr_scsi_done() // SENSE KEY/CHECK CONDITION detected, go to.. 2) ipr_erp_start() // ipr_is_gscsi() and masked_ioasc OK, go to.. 3) ipr_gen_sense() // masked_ioasc is IPR_IOASC_MED_DO_NOT_REALLOC, // so set DID_PASSTHROUGH. 4) scsi_decide_disposition() // check for DID_PASSTHROUGH and return // early on, faking a DID_OK.. *instead* // of reaching scsi_check_sense(). // Had it reached the latter, that would // set host_byte to DID_MEDIUM_ERROR. 5) scsi_finish_command() 6) scsi_io_completion() 7) __scsi_error_from_host_byte() // That would be converted to -ENODATA <...> 8) dm_softirq_done() 9) multipath_end_io() 10) do_end_io() 11) noretry_error() // And that is checked in dm-mpath :: noretry_error() // which would cause fail_path() not to be called. With this patch applied, the I/O is failed but the paths are not. This multipath device continues accepting more I/O requests without blocking. (and notice the different host byte/driver byte handling per SCSI layer). # dmesg [...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 driverbyte=DRIVER_OK [...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 40 00 [...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current] [...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend rewrite the data [...] blk_update_request: critical medium error, dev sdaf, sector 0 [...] blk_update_request: critical medium error, dev dm-6, sector 0 [...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 driverbyte=DRIVER_OK [...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 10 00 [...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current] [...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend rewrite the data [...] blk_update_request: critical medium error, dev sdaf, sector 0 [...] blk_update_request: critical medium error, dev dm-6, sector 0 [...] Buffer I/O error on dev dm-6, logical block 0, async page read # multipath -l 1IBM_IPR-0_59C2AE001F80 1IBM_IPR-0_59C2AE001F80 dm-6 IBM ,IPR-0 59C2AE00 size=5.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=active | `- 2:2:7:0 sdaf 65:240 active undef running `-+- policy='service-time 0' prio=0 status=enabled `- 1:2:7:0 sdh 8:112 active undef running Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- P.S.: The switch-case checking this condition predates linux.git, so I'd like to ask Brian for a review too, as I imagine he probably might have worked with this back then (if memory may serve one this well :-) drivers/scsi/ipr.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c index b29afafc2885..1012674d9dc5 100644 --- a/drivers/scsi/ipr.c +++ b/drivers/scsi/ipr.c @@ -6293,7 +6293,12 @@ static void ipr_e
Re: [PATCH 0/4] scsi: scsi_dh_alua: handle target port unavailable state
On 04/10/2017 10:17 PM, Mauricio Faria de Oliveira wrote: For documentation purposes, I'll reply to this cover letter with the analysis of such cases of this problem, and the accompanying messages from kernel logs. Here it goes, for anyone interested. Scenario: 4 LUNs, 2 target port groups (PGs), 2 paths per PG. mpatha (360050764008100dac100) dm-0 IBM,2145 size=40G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=active | |- 1:0:1:0 sdf 8:80 active undef running | `- 2:0:1:0 sdn 8:208 active undef running `-+- policy='service-time 0' prio=0 status=enabled |- 1:0:0:0 sdb 8:16 active undef running `- 2:0:0:0 sdj 8:144 active undef running This LUN's active port group becomes Unavailable. The ongoing I/O requests sense it via SCSI command errors. Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: [sdf] tag#0 Add. Sense: Logical unit not accessible, target port in unavailable state Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:1:0: [sdn] tag#0 Add. Sense: Logical unit not accessible, target port in unavailable state Accordingly, dm-mpath fails both paths: Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing path 8:80. Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing path 8:208. The SCSI command error has gone through alua_check_sense(). It scheduled an alua_rtpg(), which confirms its port group (for 1:0:1:0) and the other (for 2:0:0:0) are unavailable. Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 state U non-preferred Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 state U non-preferred supports tolusna Failing over, dm-mpath switched path groups, to the other/enabled port group. Notice that further SCSI command errors _happen_, but are not _reported_, as alua_rtpg() has set pg->state = unavailable before more I/O requests eventually reached alua_prep_fn(), which sets them with RQF_QUIET. Accordingly, dm-mpath fails both paths: Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing path 8:144. Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing path 8:16. The path checkers are running, and constantly receiving senses of unavailable state (seen through debug messages), and alua_rtpg() rechecks are scheduled, but no repeated messages for 'state U' are printed, as it has not changed at all. After a few minutes, both port groups become active (optimized/non-optimized), and dm-mpath reinstate all paths: Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: Reinstating path 8:16. Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: Reinstating path 8:80. Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: Reinstating path 8:144. Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: Reinstating path 8:208. The dm-mpath reinstate calls alua_activate() -> alua_rtpg() for both paths in the active port group. The first alua_rtpg() worker is still running when the second alua_rtpg() is scheduled, so pg->rtpg_sdev does not change, and it's reported as 1:0:1:0 twice (but called for 1:0:1:0 and 2:0:1:0). Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 state N non-preferred Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 state A non-preferred supports tolusna Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 state N non-preferred Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 state A non-preferred supports tolusna ... mpathb (360050764008100dac101) dm-1 IBM,2145 size=40G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=active | |- 1:0:0:1 sdc 8:32 active undef running | `- 2:0:0:1 sdk 8:160 active undef running `-+- policy='service-time 0' prio=0 status=enabled |- 1:0:1:1 sdg 8:96 active undef running `- 2:0:1:1 sdo 8:224 active undef running Similarly, Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: [sdk] tag#2 Add. Sense: Logical unit not accessible, target port in unavailable state Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 1:0:0:1: [sdc] tag#5 Add. Sense: Logical unit not accessible, target port in unavailable state Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing path 8:160. Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing path 8:32. In this case alua_rtpg() does not print messages for the 'other' port group (without 'supports'), because the other PG worker was running at the time. There was an alua_rtpg() scheduled due to alua_check_sense() for 2:0:0:1, and another for alua_activate() of 1:0:1:1 (from the 'enabled' path group). Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: alua: port group 00 state U non-preferred supports tolusna
[PATCH 3/4] scsi: scsi_dh_alua: print changes to RTPG state of other PGs too
Currently, alua_rtpg() can change the 'state' and 'preferred' values for the current port group _and_ of other port groups present in the response buffer/descriptors. However, it reports such changes _only_ for the current port group (i.e., only for 'pg' but not for 'tmp_pg'). This might cause uncertainty and confusion when going through the kernel logs for analyzing/debugging scsi_dh_alua behavior, which is not helpful during support and development scenarios. So, print such changes for other port groups than the current one. This requires finding a scsi_device to call sdev_printk() with for that other port group. Using 'tmp_pg->rtpg_sdev' is not an option because in that 'if' conditional the 'tmp_pg' is not in the ALUA_PG_RUNNING state, so that pointer may be NULL. So the for-loop over the tmp->pg device_handler structures is used to find a valid scsi_device that is associated to this port group. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/device_handler/scsi_dh_alua.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index 0d3508f2e285..c2c9173fd883 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -688,6 +688,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) if ((tmp_pg == pg) || !(tmp_pg->flags & ALUA_PG_RUNNING)) { struct alua_dh_data *h; + struct scsi_device *tmp_pg_sdev = NULL; tmp_pg->state = desc[0] & 0x0f; tmp_pg->pref = desc[0] >> 7; @@ -697,6 +698,17 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) /* h->sdev should always be valid */ BUG_ON(!h->sdev); h->sdev->access_state = desc[0]; + + /* +* If tmp_pg is not the running pg, and its worker +* is not running, tmp_pg->rtpg_sdev might be NULL. +* Use another/one associated scsi_device to print +* its RTPG information. +*/ + if ((tmp_pg != pg) && !tmp_pg_sdev) { + tmp_pg_sdev = h->sdev; + alua_rtpg_print(tmp_pg_sdev, tmp_pg, NULL); + } } rcu_read_unlock(); } -- 1.8.3.1
[PATCH 2/4] scsi: scsi_dh_alua: create alua_rtpg_print() for alua_rtpg() sdev_printk
Factor out the sdev_printk() statement with the RTPG information in alua_rtpg() into a new function, alua_rtpg_print(), that will also be used in the following patch. The only functional difference is that the 'valid_states' value is now referenced via a pointer, and can be NULL (optional), in which case the 'valid_states' information is not printed. That is for the following patch too. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/device_handler/scsi_dh_alua.c | 43 ++ 1 file changed, 32 insertions(+), 11 deletions(-) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index 5e5a33cac951..0d3508f2e285 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -523,6 +523,37 @@ static int alua_tur(struct scsi_device *sdev) } /* + * alua_rtpg_print - Print REPORT TARGET GROUP STATES information + * @sdev: the device evaluated (or associated with this port group). + * @pg: the port group the information is associated with. + * @valid_states: pointer to valid_states value. + *(optional; e.g., obtained via another port group) + * + * Must be called with pg->lock held. + */ +static void alua_rtpg_print(struct scsi_device *sdev, struct alua_port_group *pg, + int *valid_states) +{ + if (valid_states) + sdev_printk(KERN_INFO, sdev, + "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n", + ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state), + pg->pref ? "preferred" : "non-preferred", + (*valid_states) & TPGS_SUPPORT_TRANSITION ? 'T' : 't', + (*valid_states) & TPGS_SUPPORT_OFFLINE ? 'O' : 'o', + (*valid_states) & TPGS_SUPPORT_LBA_DEPENDENT ? 'L' : 'l', + (*valid_states) & TPGS_SUPPORT_UNAVAILABLE ? 'U' : 'u', + (*valid_states) & TPGS_SUPPORT_STANDBY ? 'S' : 's', + (*valid_states) & TPGS_SUPPORT_NONOPTIMIZED ? 'N' : 'n', + (*valid_states) & TPGS_SUPPORT_OPTIMIZED ? 'A' : 'a'); + else + sdev_printk(KERN_INFO, sdev, + "%s: port group %02x state %c %s\n", + ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state), + pg->pref ? "preferred" : "non-preferred"); +} + +/* * alua_rtpg - Evaluate REPORT TARGET GROUP STATES * @sdev: the device to be evaluated. * @@ -679,17 +710,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) } spin_lock_irqsave(>lock, flags); - sdev_printk(KERN_INFO, sdev, - "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n", - ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state), - pg->pref ? "preferred" : "non-preferred", - valid_states_SUPPORT_TRANSITION?'T':'t', - valid_states_SUPPORT_OFFLINE?'O':'o', - valid_states_SUPPORT_LBA_DEPENDENT?'L':'l', - valid_states_SUPPORT_UNAVAILABLE?'U':'u', - valid_states_SUPPORT_STANDBY?'S':'s', - valid_states_SUPPORT_NONOPTIMIZED?'N':'n', - valid_states_SUPPORT_OPTIMIZED?'A':'a'); + alua_rtpg_print(sdev, pg, _states); switch (pg->state) { case SCSI_ACCESS_STATE_TRANSITIONING: -- 1.8.3.1
[PATCH 0/4] scsi: scsi_dh_alua: handle target port unavailable state
This patch series resolves a problem in which all paths of a multipath device became _permanently_ failed after a storage system had moved both controllers into a _temporarily_ unavailable state (that is SCSI_ACCESS_STATE_UNAVAILABLE). This happened because once scsi_dh_alua had set the 'pg->state' to that value, any IO coming to that PG via alua_prep_fn() would be immediately failed there. It was possible to confirm that IO coming to that PG by another function path (e.g., SG_IO) would perform normally once that PG's respective storage system controller had transitioned back to an active state. - Patch 1 essentially resolves that problem by allowing IO requests coming in the SCSI_ACCESS_STATE_UNAVAILABLE to actually proceed in alua_prep_fn(). It also schedules a recheck in alua_check_sense() to update pg->state properly. The problem/debug test-case is included in its commit message for reference. - Patch 2 and Patch 3 address uncertainty & potentially incorrect assumptions when trying to reconcile the alua: RTPG information in the kernel logs with the actual port groups state at a given point in time and to multipath/path checkers status/failed/reinstated messages, since scsi_dh_alua could update the PG state for the 'other' PG (i.e., not the PG by which the RTPG request was sent to) but only present an updated state message for the 'current' PG. - Patch 4 silences the scsi_dh_alua messages about RTPG state/information for the unavailable state if it is no news (i.e., not a transition to/out-of it), only keeping the first and (potentially) last message (when it is some news). That's because during the period in which the unavailable state is in place, the path checkers will naturally have to go through alua_check_sense() path, which schedules a recheck and thus alua_rtpg() goes through the sdev_printk. This patch series has been tested with the 4.11-rc4 kernel. For documentation purposes, I'll reply to this cover letter with the analysis of such cases of this problem, and the accompanying messages from kernel logs. Mauricio Faria de Oliveira (4): scsi: scsi_dh_alua: allow I/O in the target port unavailable state scsi: scsi_dh_alua: create alua_rtpg_print() for alua_rtpg() sdev_printk scsi: scsi_dh_alua: print changes to RTPG state of other PGs too scsi: scsi_dh_alua: do not print target port group state if it remains unavailable drivers/scsi/device_handler/scsi_dh_alua.c | 99 ++ 1 file changed, 88 insertions(+), 11 deletions(-) -- 1.8.3.1
[PATCH 1/4] scsi: scsi_dh_alua: allow I/O in the target port unavailable state
According to SPC-4 (5.15.2.4.5 Unavailable state), the unavailable state may (or may not) transition to other states (e.g., microcode downloading or hardware error, which may be temporary or permanent conditions, respectively). But, scsi_dh_alua currently fails the I/O requests early once that state is established (in alua_prep_fn()), which provides no chance for path checkers going through that function path to really check whether the path actually still fails I/O requests or recovered to an active state. This might cause device-mapper multipath to fail all paths to some storage system that moves the controllers to the unavailable state for firmware upgrades, and never recover regardless of the storage system doing upgrades one controller at a time and get them online. Then I/O requests are blocked indefinitely due to queue_if_no_path but the underlying individual paths are fully operational, and can be verified as such through other function paths (e.g., SG_IO): # multipath -l mpatha (360050764008100dac100) dm-0 IBM,2145 size=40G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | |- 1:0:1:0 sdf 8:80 failed undef running | `- 2:0:1:0 sdn 8:208 failed undef running `-+- policy='service-time 0' prio=0 status=enabled |- 1:0:0:0 sdb 8:16 failed undef running `- 2:0:0:0 sdj 8:144 failed undef running # strace -e read \ sg_dd if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \ 2>&1 | grep 512 read(3, 0x3fff7ba8, 512) = -1 EIO (Input/output error) # strace -e ioctl \ sg_dd if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \ blk_sgio=1 \ 2>&1 | grep 512 ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[10]=[28, 00, 00, 00, 00, 00, 00, 00, 01, 00], <...>) = 0 So, allow I/O to target port (groups) in the unavailable state, so the path checkers can actually check them, and schedule a recheck whenever the unavailable state is detected so pg->state can be updated properly (and further SCSI IO error messages then silenced through alua_prep_fn()). Once a path checker eventually detects an active state again, the port group state will be updated by the path activation call, alua_activate(), as it schedules an alua_rtpg() check. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Reported-by: Naresh Bannoth <nbann...@in.ibm.com> --- drivers/scsi/device_handler/scsi_dh_alua.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index c01b47e5b55a..5e5a33cac951 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -431,6 +431,20 @@ static int alua_check_sense(struct scsi_device *sdev, alua_check(sdev, false); return NEEDS_RETRY; } + if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0c) { + /* +* LUN Not Accessible - target port in unavailable state. +* +* It may (not) be possible to transition to other states; +* the transition might take a while or not happen at all, +* depending on the storage system model, error type, etc. +* +* Do not retry, so failover to another target port occur. +* Schedule a recheck to update state for other functions. +*/ + alua_check(sdev, true); + return SUCCESS; + } break; case UNIT_ATTENTION: if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00) { @@ -1057,6 +1071,8 @@ static void alua_check(struct scsi_device *sdev, bool force) * * Fail I/O to all paths not in state * active/optimized or active/non-optimized. + * Allow I/O to all paths in state unavailable + * so path checkers can actually check them. */ static int alua_prep_fn(struct scsi_device *sdev, struct request *req) { @@ -1072,6 +1088,8 @@ static int alua_prep_fn(struct scsi_device *sdev, struct request *req) rcu_read_unlock(); if (state == SCSI_ACCESS_STATE_TRANSITIONING) ret = BLKPREP_DEFER; + else if (state == SCSI_ACCESS_STATE_UNAVAILABLE) + req->rq_flags |= RQF_QUIET; else if (state != SCSI_ACCESS_STATE_OPTIMAL && state != SCSI_ACCESS_STATE_ACTIVE && state != SCSI_ACCESS_STATE_LBA) { -- 1.8.3.1
[PATCH 4/4] scsi: scsi_dh_alua: do not print target port group state if it remains unavailable
Path checkers will periodically check all paths to a target port group in unavailable state more often (as they are 'failed'), possibly for a long or indefinite period of time, or for a large number of paths. That might end up flooding the kernel log with the scsi_dh_alua target port group state messages, which are triggered by the rtpg recheck scheduled in alua_check_sense(). So, modify alua_rtpg() not to print such message when that unavailable state remains (i.e., the target port group state did not transition to or from the unavailable state). All other cases continue to be printed. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/device_handler/scsi_dh_alua.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index c2c9173fd883..8025e024c011 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -554,6 +554,17 @@ static void alua_rtpg_print(struct scsi_device *sdev, struct alua_port_group *pg } /* + * alua_state_remains - Whether a RTPG state remains the same across 2 values. + * @state: the state value to check for. + * @old_state: the old state value. + * @new_state: the new state value. + */ +static bool alua_state_remains(int state, int old_state, int new_state) +{ + return ((old_state == state) && (new_state == state)); +} + +/* * alua_rtpg - Evaluate REPORT TARGET GROUP STATES * @sdev: the device to be evaluated. * @@ -571,6 +582,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) unsigned int tpg_desc_tbl_off; unsigned char orig_transition_tmo; unsigned long flags; + int orig_state; if (!pg->expiry) { unsigned long transition_tmo = ALUA_FAILOVER_TIMEOUT * HZ; @@ -656,6 +668,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) goto retry; } + orig_state = pg->state; + orig_transition_tmo = pg->transition_tmo; if ((buff[4] & RTPG_FMT_MASK) == RTPG_FMT_EXT_HDR && buff[5] != 0) pg->transition_tmo = buff[5]; @@ -689,6 +703,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) !(tmp_pg->flags & ALUA_PG_RUNNING)) { struct alua_dh_data *h; struct scsi_device *tmp_pg_sdev = NULL; + int tmp_pg_orig_state = tmp_pg->state; tmp_pg->state = desc[0] & 0x0f; tmp_pg->pref = desc[0] >> 7; @@ -704,8 +719,11 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) * is not running, tmp_pg->rtpg_sdev might be NULL. * Use another/one associated scsi_device to print * its RTPG information. +* Don't print it if state remains 'unavailable'. */ - if ((tmp_pg != pg) && !tmp_pg_sdev) { + if ((tmp_pg != pg) && !tmp_pg_sdev && + !alua_state_remains(SCSI_ACCESS_STATE_UNAVAILABLE, + tmp_pg_orig_state, tmp_pg->state)) { tmp_pg_sdev = h->sdev; alua_rtpg_print(tmp_pg_sdev, tmp_pg, NULL); } @@ -722,7 +740,11 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) } spin_lock_irqsave(>lock, flags); - alua_rtpg_print(sdev, pg, _states); + + /* Print RTPG information (except if state remains 'unavailable'). */ + if (likely(!alua_state_remains(SCSI_ACCESS_STATE_UNAVAILABLE, + orig_state, pg->state))) + alua_rtpg_print(sdev, pg, _states); switch (pg->state) { case SCSI_ACCESS_STATE_TRANSITIONING: -- 1.8.3.1
Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe
On 04/05/2017 01:23 PM, Song Liu wrote: Reviewed-by: Song Liu <songliubrav...@fb.com> Thanks for reviewing, Song Liu. It's good to know this patch doesn't break anything for you. cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH v2] scsi: ses: don't get power status of SES device slot on probe
The commit 08024885a2a3 ("ses: Add power_status to SES device slot") introduced the 'power_status' attribute to enclosure components and the associated callbacks. There are 2 callbacks available to get the power status of a device: 1) ses_get_power_status() for 'struct enclosure_component_callbacks' 2) get_component_power_status() for the sysfs device attribute (these are available for kernel-space and user-space, respectively.) However, despite both methods being available to get power status on demand, that commit also introduced a call to get power status in ses_enclosure_data_process(). This dramatically increased the total probe time for SCSI devices on larger configurations, because ses_enclosure_data_process() is called several times during the SCSI devices probe and loops over the component devices (but that is another problem, another patch). That results in a tremendous continuous hammering of SCSI Receive Diagnostics commands to the enclosure-services device, which does delay the total probe time for the SCSI devices __significantly__: Originally, ~34 minutes on a system attached to ~170 disks: [ 9214.490703] mpt3sas version 13.100.00.00 loaded ... [11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) With this patch, it decreased to ~2.5 minutes -- a 13.6x faster [ 1002.992533] mpt3sas version 13.100.00.00 loaded ... [ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) Back to the commit discussion.. on the ses_get_power_status() call introduced in ses_enclosure_data_process(): impact of removing it. That may possibly be in place to initialize the power status value on device probe. However, those 2 functions available to retrieve that value _do_ automatically refresh/update it. So the potential benefit would be a direct access of the 'power_status' field which does not use the callbacks... But the only reader of 'struct enclosure_component::power_status' is the get_component_power_status() callback for sysfs attribute, and it _does_ check for and call the .get_power_status callback, (which indeed is defined and implemented by that commit), so the power status value is, again, automatically updated. So, the remaining potential for a direct/non-callback access to the power_status attribute would be out-of-tree modules -- well, for those, if they are for whatever reason interested in values that are set during device probe and not up-to-date by the time they need it.. well, that would be curious. Well, to handle that more properly, set the initial power state value to '-1' (i.e., uninitialized) instead of '1' (power 'on'), and check for it in that callback which may do an direct access to the field value _if_ a callback function is not defined. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot") --- v2: - remove module parameter. - better return values for -1/uninitalized power_status. (thanks: Dan Williams <dan.j.willi...@intel.com>) drivers/misc/enclosure.c | 7 ++- drivers/scsi/ses.c | 1 - 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed7146e9b..d3fe3ea902d4 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -148,7 +148,7 @@ struct enclosure_device * for (i = 0; i < components; i++) { edev->component[i].number = -1; edev->component[i].slot = -1; - edev->component[i].power_status = 1; + edev->component[i].power_status = -1; } mutex_lock(_list_lock); @@ -594,6 +594,11 @@ static ssize_t get_component_power_status(struct device *cdev, if (edev->cb->get_power_status) edev->cb->get_power_status(edev, ecomp); + + /* If still uninitialized, the callback failed or does not exist. */ + if (ecomp->power_status == -1) + return (edev->cb->get_power_status) ? -EIO : -ENOTTY; + return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off"); } diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index 50adabbb5808..f1cdf32d7514 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -548,7 +548,6 @@ static void ses_enclosure_data_process(struct enclosure_device *edev, ecomp = >component[components++]; if (!IS_ERR(ecomp)) { - ses_get_power_status(edev, ecomp); if (addl_desc_ptr) ses_process_descriptor( ecomp, -- 1.8.3.1
Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe
On 04/05/2017 11:41 AM, Dan Williams wrote: On Wed, Apr 5, 2017 at 6:13 AM, Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> wrote: 1) imagine .get_power_status couldn't update the 'power_status' field (it's a bit unlikely with the in-tree ses driver, but in the case that ses_get_page2_descriptor() returns NULL, ses_get_power_status() doesn't update 'power_status'). Ok, in that case we should probably return -EIO, if get_power_status is non-NULL, but the power_status is still -1. 2) an out-of-tree module employs another enclosure_component_callbacks structure, which doesn't implement a .get_power_status hook. If get_power_status is null and power_status is -1 I think we should return -ENOTTY to indicate the failure. 3) or it does, but that failed, and didn't update 'power_status' (e.g., like case 1) I think we're back -EIO for this. Absolutely. I see those are better values to convey the failure/reason. However, for an in-tree usage, it's clear that just dropping that line seemed to be sufficient -- the rest in the patch is just consideration for whoever might be using it in out-of-tree ways. (and if such consideration is deemed not to be required in this case, I'd be fine with dropping the related changes and making this patch a one-line. :- ) Let's not carry module parameters that only theoretically help an out of tree module. If such a driver exists its owner can just holler if this policy change breaks their usage, and then we can have a discussion about why their driver isn't upstream already. Okay, got it. For the record, the patch author has been in To:/Cc:, for such cases. I'll submit a v2. Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [lpfc 05/19] Fix driver unload/reload operation.
Hi Dick, If you don't mind me being pedantic, I've noticed a couple of things. First, is your different email addresses in from/signed-off-by lines: On 04/04/2017 02:16 PM, Dick Kennedy wrote: From: Dick Kennedy <rkenn...@lvnvda1400.lvn.broadcom.net> [...] Signed-off-by: Dick Kennedy <dick.kenn...@broadcom.com> Second, is there are several indentation/line-break changes in this 'fix' patch too, of which at least one seems inconsistent in the code (see below) -- that would be better split off to another patch IMHO. Maybe that would be covered in the scrubbing process you mentioned in the other/regression thread. @@ -10312,6 +10315,7 @@ lpfc_pci_probe_one_s3(struct pci_dev *pdev, const struct pci_device_id *pid) } /* Initialize and populate the iocb list per host */ + error = lpfc_init_iocb_list(phba, LPFC_IOCB_LIST_CNT); if (error) { lpfc_printf_log(phba, KERN_ERR, LOG_INIT, @@ -11092,7 +11096,6 @@ lpfc_pci_probe_one_s4(struct pci_dev *pdev, const struct pci_device_id *pid) } /* Initialize and populate the iocb list per host */ - lpfc_printf_log(phba, KERN_INFO, LOG_INIT, "2821 initialize iocb list %d.\n", phba->cfg_iocb_cnt*1024); cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe
Hi Dan, Thanks for reviewing. On 04/04/2017 06:07 PM, Dan Williams wrote: @@ -594,6 +594,10 @@ static ssize_t get_component_power_status(struct device *cdev, if (edev->cb->get_power_status) edev->cb->get_power_status(edev, ecomp); + + if (ecomp->power_status == -1) + return -EINVAL; + Can we ever hit this if get_power_status is non-null? I admit that is cautionary, and more for consistency with the chance of an uninitialized state being still in place, so not to return any (incorrect) value in that state. But, yes, in a few cases -- although unlikely. 1) imagine .get_power_status couldn't update the 'power_status' field (it's a bit unlikely with the in-tree ses driver, but in the case that ses_get_page2_descriptor() returns NULL, ses_get_power_status() doesn't update 'power_status'). 2) an out-of-tree module employs another enclosure_component_callbacks structure, which doesn't implement a .get_power_status hook. 3) or it does, but that failed, and didn't update 'power_status' (e.g., like case 1) +static bool power_status_on_probe; /* default: false, 0. */ +module_param(power_status_on_probe, bool, 0644); +MODULE_PARM_DESC(power_status_on_probe, "get power status of SES device slots " + "(enclosure components) at probe time " + "(warning: may delay total probe time)"); + I don't think we need this module parameter as long as the only way to read the power status is through sysfs. We can always just leave it to be read on-demand when needed. I agree for the in-tree module. My only concern, as described in the commit message, is the case someone is using an out-of-tree module / has an implementation that depends on that (e.g., reading the 'power_ status' field directly, instead of using .get_power_status. I've been a bit overzealous when considering why that call was inserted in ses_enclosure_data_process(), and didn't want to break them. However, for an in-tree usage, it's clear that just dropping that line seemed to be sufficient -- the rest in the patch is just consideration for whoever might be using it in out-of-tree ways. (and if such consideration is deemed not to be required in this case, I'd be fine with dropping the related changes and making this patch a one-line. :- ) cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown
Hi Martin and Junichi, On 04/03/2017 11:10 PM, Junichi Nomura wrote: On 04/04/17 06:53, Mauricio Faria de Oliveira wrote: On 03/28/2017 11:29 PM, Junichi Nomura wrote: Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"), "rmmod lpfc" starting to cause panic or corruption due to double free. Thanks for the report. Can you please check whether the patch just sent ([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it? It works for me. Thank you! Excellent, thanks! Martin, can you review/consider it for 4.11-rc6, please? Considering future maintenance, it might be a bit fragile to just depend on the code comment about representing the relation between cq/wq and shared pring but it's maintainers' choice. I agree -- there should be a better way of identifying a bound WQ/CQ. Perhaps there is, but I couldn't find it currently. For now, as far as I could grep and examine the code (detailed in commit message), a WQ is always bound to a CQ, so to check for WQ and not free its ring pointer seems to be sufficient (as the CQ ring pointer is freed first). If that changes, probably some form of flagging and/or queue type determination would be better/necessary. cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown
Hi Junichi, On 03/28/2017 11:29 PM, Junichi Nomura wrote: Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"), "rmmod lpfc" starting to cause panic or corruption due to double free. Thanks for the report. Can you please check whether the patch just sent ([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it? I don't have a setup to test it handy right now. cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH] lpfc: fix double free of bound CQ/WQ ring pointer
commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications") binds the CQs and WQs ring pointer (sets it to same address on both). lpfc_create_wq_cq(): ... rc = lpfc_cq_create(phba, cq, eq, <...>) ... rc = lpfc_wq_create(phba, wq, cq, qtype); ... /* Bind this CQ/WQ to the NVME ring */ pring = wq->pring; ... cq->pring = pring; ... The commit frees both CQ & WQ for FCP/NVME on lpfc_sli4_queue_destroy(), which causes a double free (potential corruption or panic) on freeing the ring pointer of the second entity (CQ is first, WQ is second): lpfc_pci_remove_one() # that is, .remove / .shutdown -> lpfc_pci_remove_one_s4() -> lpfc_sli4_hba_unset() -> lpfc_sli4_queue_destroy() -> lpfc_sli4_release_queues() # Release FCP/NVME cqs -> __lpfc_sli4_release_queue() -> lpfc_sli4_queue_free() -> kfree(queue->pring) # first free -> lpfc_sli4_release_queues() # Release FCP/NVME wqs -> __lpfc_sli4_release_queue() -> lpfc_sli4_queue_free() -> kfree(queue->pring) # second free So, check for WQs in lpfc_sli4_queue_free() and do not free the pring, as it is freed before in the bound CQ. [the WQs are created only via lpfc_wq_create(), which sets struct lpfc_queue::type == LPFC_WQ. And that happens in 2 sites (lpfc_create_wq_cq() & lpfc_fof_queue_setup()), both of which bind the CQs & WQs. Thus, checking for the LPFC_WQ type correlates to whether the WQ is bound to a CQ, which is freed first.] Additional details: For reference, that binding also occurs on one other function: lpfc_fof_queue_setup(): ... rc = lpfc_cq_create(phba, phba->sli4_hba.oas_cq, <...>) ... rc = lpfc_wq_create(phba, phba->sli4_hba.oas_wq, <...>) ... /* Bind this CQ/WQ to the NVME ring */ pring = phba->sli4_hba.oas_wq->pring; ... phba->sli4_hba.oas_cq->pring = pring; And used to occur similarly on lpfc_sli4_queue_setup(), but was changed by that commit; although the problem is more related to the new freeing pattern introduced in lpfc_sli4_queue_destroy() plus the bound CQs/WQs. - /* Bind this WQ to the next FCP ring */ - pring = >ring[MAX_SLI3_CONFIGURED_RINGS + fcp_wqidx]; ... - phba->sli4_hba.fcp_cq[fcp_wqidx]->pring = pring; commit 85e8a23936ab ("scsi: lpfc: Add shutdown method for kexec") made this more likely as lpfc_pci_remove_one() is called on driver shutdown (e.g., modprobe -r / rmmod). (this patch is partially based on a different patch suggested by Johannes, thus adding a Suggested-by tag for due credit.) Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Reported-by: Junichi Nomura <j-nom...@ce.jp.nec.com> Suggested-by: Johannes Thumshirn <jthumsh...@suse.de> --- drivers/scsi/lpfc/lpfc_sli.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c index 1c9fa45df7eb..8befe841adaa 100644 --- a/drivers/scsi/lpfc/lpfc_sli.c +++ b/drivers/scsi/lpfc/lpfc_sli.c @@ -13758,7 +13758,14 @@ void lpfc_sli4_els_xri_abort_event_proc(struct lpfc_hba *phba) lpfc_free_rq_buffer(queue->phba, queue); kfree(queue->rqbp); } - kfree(queue->pring); + + /* +* The WQs/CQs' pring is bound (same pointer). +* So free it only once, and not again on WQ. +*/ + if (queue->type != LPFC_WQ) + kfree(queue->pring); + kfree(queue); return; } -- 1.8.3.1
[PATCH] scsi: ses: don't get power status of SES device slot on probe
The commit 08024885a2a3 ("ses: Add power_status to SES device slot") introduced the 'power_status' attribute to enclosure components and the associated callbacks. There are 2 callbacks available to get the power status of a device: 1) ses_get_power_status() for 'struct enclosure_component_callbacks' 2) get_component_power_status() for the sysfs device attribute (these are available for kernel-space and user-space, respectively.) However, despite both methods being available to get power status on demand, that commit also introduced a call to get power status in ses_enclosure_data_process(). This dramatically increased the total probe time for SCSI devices on larger configurations, because ses_enclosure_data_process() is called several times during the SCSI devices probe and loops over the component devices (but that is another problem, another patch). That results in a tremendous continuous hammering of SCSI Receive Diagnostics commands to the enclosure-services device, which does delay the total probe time for the SCSI devices __significantly__: Originally, ~34 minutes on a system attached to ~170 disks: [ 9214.490703] mpt3sas version 13.100.00.00 loaded ... [11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) With this patch, it decreased to ~2.5 minutes -- a 13.6x faster [ 1002.992533] mpt3sas version 13.100.00.00 loaded ... [ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) Another downside is that the SCSI device class lock is held during the executions of ses_enclosure_data_process() since it's a callee of ses_intf_add(), which is executed under that lock in device_add(). This ends up blocking the parallel SCSI scan functionality because device_add() depends on that lock, and also the removal of devices via device_del() (e.g., during an enclosure reboot). Back to the commit discussion.. on the ses_get_power_status() call introduced in ses_enclosure_data_process(): impact of removing it. That may possibly be in place to initialize the power status value on device probe. However, those 2 functions available to retrieve that value _do_ automatically refresh/update it. So the potential benefit would be a direct access of the 'power_status' field which does not use the callbacks... But the only reader of 'struct enclosure_component::power_status' is the get_component_power_status() callback for sysfs attribute, and it _does_ check for and call the .get_power_status callback, (which indeed is defined and implemented by that commit), so the power status value is, again, automatically updated. So, the remaining potential for a direct/non-callback access to the power_status attribute would be out-of-tree modules -- well, for those, if they are for whatever reason interested in values that are set during device probe and not up-to-date by the time they need it.. well, that would be curious. So, for out-of-tree modules and people still interested in this behavior (in case I might have missed something about it, other than the total probe time impact): let's add a module parameter to keep that check turned on (disabled by default). Also, to handle that more properly, set the initial power state value to '-1' (i.e., uninitialized) instead of '1' (power 'on'), and check for it in that callback which may do an direct access to the field value _if_ a callback function is not defined. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot") --- drivers/misc/enclosure.c | 6 +- drivers/scsi/ses.c | 9 - 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed7146e9b..d3a8a13c4247 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -148,7 +148,7 @@ struct enclosure_device * for (i = 0; i < components; i++) { edev->component[i].number = -1; edev->component[i].slot = -1; - edev->component[i].power_status = 1; + edev->component[i].power_status = -1; } mutex_lock(_list_lock); @@ -594,6 +594,10 @@ static ssize_t get_component_power_status(struct device *cdev, if (edev->cb->get_power_status) edev->cb->get_power_status(edev, ecomp); + + if (ecomp->power_status == -1) + return -EINVAL; + return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off"); } diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index 50adabbb5808..2fafefd419c1 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -36,6 +36,12 @@ #include +static bool power_status_on_probe; /* default: false, 0. */ +module_param(power_status
Re: [PATCHv3 1/6] scsi_error: count medium access timeout only once per EH run
On 03/13/2017 11:48 AM, Hannes Reinecke wrote: This is assuming that we're always running on a scsi_disk, and that scsi_disk is the only one implementing 'eh_action'. Neither of which is necessarily true. Ah, OK. Thanks for explaining. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCHv3 1/6] scsi_error: count medium access timeout only once per EH run
on access commands like * test unit ready (so wrongly see the device as having a successful - * recovery) + * recovery). + * We have to be careful to count a medium access failure only once + * per SCSI EH run; there might be several timed out commands which + * will cause the 'max_medium_access_timeouts' counter to trigger + * after the first SCSI EH run already and set the device to offline. **/ @@ -1714,7 +1724,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp) * process of recovering or has it suffered an internal failure * that prevents access to the storage medium. */ - sdkp->medium_access_timed_out++; + if (!sdkp->medium_access_reset) { + sdkp->medium_access_timed_out++; + sdkp->medium_access_reset = 1; + } /* * If the device keeps failing read/write commands but TEST UNIT diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 4dac35e..6a4f75a 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -106,6 +106,7 @@ struct scsi_disk { unsignedrc_basis: 2; unsignedzoned: 2; unsignedurswrz : 1; + unsignedmedium_access_reset : 1; }; #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev) -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v3 01/16] lpfc: Correct WQ creation for pagesize
Hello stable kernel maintainers, On 03/07/2017 12:35 AM, Martin K. Petersen wrote: Mauricio> Please flag this patch for stable. Mauricio> This patch resolves a serious problem on IBM Power systems at Mauricio> least. Both patches are already upstream so I can't tag them for stable. Either you or James should mail sta...@vger.kernel.org and request for the patches to be queued up. Can this commit be included on stable v4.4+ , please? (in Linus tree) 8ea73db486cda442f0671f4bc9c03a76be398a28 "scsi: lpfc: Correct WQ creation for pagesize" Thanks! For documentation purposes, the resolved issue is described on [1]. Following is the rationale for the stable kernel version asked for, in case Broadcom/others would like to discuss/contribute. These are the 3 commits which introduced the lines changed by the desired fix commit 0c651878ba3018bb4bbfa2ccd0a876bebb618768 "[SCSI] lpfc 8.3.41: Fixed support for 128 byte WQEs" 5a6f133eea2d0b4f8f75367b803fef0f03acf268 "[SCSI] lpfc 8.3.22: Add new mailbox command and new BSG fix" c31098cef5e091e22a02ff255f911e0ad71cc393 "[SCSI] lpfc 8.3.23: Fixes related to new hardware" So, checking the oldest release git tag which contains the most recent commit (so the fix applies cleanly) $ git tag --contains 0c651878ba3018bb4bbfa2ccd0a876bebb618768 | grep -v rc | sort -V v3.12 <...> v4.10 And it is cherry-picked/applies cleanly on v3.12: $ git checkout -b lpfc-v3.12 v3.12 Switched to a new branch 'lpfc-v3.12' $ git cherry-pick 8ea73db486cda442f0671f4bc9c03a76be398a28 [lpfc-v3.12 c2b0912e6135] scsi: lpfc: Correct WQ creation for pagesize Author: James Smart <jsmart2...@gmail.com> 2 files changed, 7 insertions(+), 4 deletions(-) However, v3.12 seems too old to go back, and it's unclear whether there was any problem actually reported on these older kernel releases, or whether the hardware/configuration that reproduces this problem is supported on v3.12 at all. On PowerPC (which is directly affected by the fix, as at least IBM Power servers uses a 64k page size by default on most distros), this fix is probably only required on v4.4+, so to cover a distro for which Broadcom doesn't ship inbox drivers to (w/ fix included). So, I'll ask it for stable v4.4+, and Broadcom/others can expand the range if they see fit/required. [1] http://www.spinics.net/lists/linux-scsi/msg105886.html cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH] scsi: lpfc: Add shutdown method for kexec
Martin, On 03/07/2017 02:24 AM, Benjamin Herrenschmidt wrote: On Mon, 2017-03-06 at 22:46 -0500, Martin K. Petersen wrote: I don't recall a consensus being reached on this patch. What would be the opposition ? Without it kexec breaks. With it, it works ... That is the argument I'd present/ask for consideration too. I think I should have included this in the tested-by tag email, for documentation/evidence: no regression observed in system shutdown path. Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v3 01/16] lpfc: Correct WQ creation for pagesize
Hi Martin, On 03/07/2017 12:35 AM, Martin K. Petersen wrote: Mauricio> Please flag this patch for stable. Both patches are already upstream so I can't tag them for stable. Either you or James should mail sta...@vger.kernel.org and request for the patches to be queued up. Right, sorry; I missed checking the right tree. Thanks for the pointers. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH] scsi: lpfc: Add shutdown method for kexec
On 02/12/2017 07:49 PM, Anton Blanchard wrote: We see lpfc devices regularly fail during kexec. Fix this by adding a shutdown method which mirrors the remove method. Reviewed-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Tested-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> @mkp, @jejb: may you please flag this patch for stable? Thank you. This patch fixes a problem of some adapter ports going offline after reboot on the following adapter, at least: [10df:f100] Saturn-X: LightPulse Fibre Channel Host Adapter # dmesg | grep lpfc ... [ 61.676728] lpfc 0003:12:00.0: 6:0442 Adapter failed to init, mbxCmd x88 CONFIG_PORT, mbxStatus x0 Data: x0 [ 92.766717] lpfc 0003:12:00.0: 6:0442 Adapter failed to init, mbxCmd x88 CONFIG_PORT, mbxStatus x0 Data: x0 [ 123.856717] lpfc 0003:12:00.0: 6:0442 Adapter failed to init, mbxCmd x88 CONFIG_PORT, mbxStatus x0 Data: x0 ... -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v3 01/16] lpfc: Correct WQ creation for pagesize
Hi Martin and James, On 02/12/2017 07:52 PM, James Smart wrote: Correct WQ creation for pagesize Reviewed-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Please flag this patch for stable. This patch resolves a serious problem on IBM Power systems at least. An (apparently constant) series of invalid iotags cause a series of SCSI command aborts that escalates into host reset, which fails too, bringing the adapter offline. And it takes forever to boot, as udev waits on the initial SCSI I/O. Thank you. [ 17.044690] lpfc :01:00.1: 1:0372 iotag x0 is out off range: max iotag (x840) ... [ 723.680700] lpfc :01:00.1: 1:(0):0714 SCSI layer issued Bus Reset Data: x2003 [ 723.680873] lpfc :01:00.1: 1:(0):3172 SCSI layer issued Host Reset Data: [ 724.052506] lpfc :01:00.1: 1:0383 Error -5 during scsi sgl post operation [ 724.052701] lpfc :01:00.1: 1:(0):3323 Failed host reset, bring it offline -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH v2 10/11] lpfc: Add missing memory barrier
On 12/19/2016 09:07 PM, James Smart wrote: On loosely ordered memory systems (PPC for example), the WQE elements were being updated in memory, but not necessarily flushed before the separate doorbell was written to hw which would cause hw to dma the WQE element. Thus, the hardware occasionally received partially updated WQE data. Add the memory barrier after updating the WQE memory. Reviewed-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Martin, may you please flag this patch for stable? Thank you, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH 2/2] qla2xxx: Avoid that issuing a LIP triggers a kernel crash
Hi Martin, On 01/25/2017 09:29 PM, Martin K. Petersen wrote: Please do a proper patch submission for this fix. Okay, I submitted a v2 patch w/ the suggested change. However, the original patch has been submitted by Bart, so I believe credit is due, but not sure how to handle this case. Thus, please feel free to change the sign-off line as appropriate here. Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] qla2xxx: Avoid that issuing a LIP triggers a kernel crash
Avoid that issuing a LIP as follows: find /sys -name 'issue_lip'|while read f; do echo 1 > $f; done triggers the following: BUG: unable to handle kernel NULL pointer dereference at (null) Call Trace: qla2x00_abort_all_cmds+0xed/0x140 [qla2xxx] qla2x00_abort_isp_cleanup+0x1e3/0x280 [qla2xxx] qla2x00_abort_isp+0xef/0x690 [qla2xxx] qla2x00_do_dpc+0x36c/0x880 [qla2xxx] kthread+0x10c/0x140 Note: this patch is a slight change of the original patch sent by Bart, submitted by request of mkp. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Reported-by: Bart Van Assche <bart.vanass...@sandisk.com> Fixes: 1535aa75a3d8 ("qla2xxx: fix invalid DMA access after command aborts in PCI device remove") Cc: Himanshu Madhani <himanshu.madh...@cavium.com> Cc: <sta...@vger.kernel.org> --- drivers/scsi/qla2xxx/qla_os.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 0a000ecf0881..a17cb63b3fd5 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -1600,6 +1600,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) srb_t *sp; struct qla_hw_data *ha = vha->hw; struct req_que *req; + struct scsi_cmnd *scmd; qlt_host_reset_handler(ha); @@ -1613,10 +1614,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) { sp = req->outstanding_cmds[cnt]; if (sp) { + scmd = GET_CMD_SP(sp); + /* Don't abort commands in adapter during EEH * recovery as it's not accessible/responding. */ - if (!ha->flags.eeh_busy) { + if (scmd && !ha->flags.eeh_busy) { /* Get a reference to the sp and drop the lock. * The reference ensures this sp->done() call * - and not the call in qla2xxx_eh_abort() - @@ -1624,7 +1627,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) */ sp_get(sp); spin_unlock_irqrestore(>hardware_lock, flags); - qla2xxx_eh_abort(GET_CMD_SP(sp)); + qla2xxx_eh_abort(scmd); spin_lock_irqsave(>hardware_lock, flags); } req->outstanding_cmds[cnt] = NULL; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] qla2xxx: Avoid that issuing a LIP triggers a kernel crash
Hi Bart, First of all, sorry for the new bug; I didn't realize the pointer could be NULL at this scenario. On 01/23/2017 02:34 PM, Bart Van Assche wrote: @@ -1624,7 +1627,8 @@ qla2x00_abort_all_cmds(scsi_qla_host_t *vha, int res) */ sp_get(sp); spin_unlock_irqrestore(>hardware_lock, flags); - qla2xxx_eh_abort(GET_CMD_SP(sp)); + if (scmd) + qla2xxx_eh_abort(scmd); spin_lock_irqsave(>hardware_lock, flags); } Now, this chunk has a problem with reference counting (and unnecessary spin-locking), which we can avoid by simply moving up this NULL check. The call to sp_get() increments the sp->ref_count, but if you skip the call to qla2xxx_eh_abort() you don't get the decrement from the call to sp->done() at abort handling from ISR, e.g., qla24xx_abort_iocb_entry(). [or if the command completed successfully between issue/complete abort, at the completion from ISR, e.g., qla2x00_process_completed_request().] The sp->done() call just below this chunk was supposed to drop the initial reference [set at qla2xxx_queuecommand()] at a time we did not call qla2xxx_eh_abort() yet... but now that we __may__ call it (and get that sp->done() call from the ISR abort handling), we need to only increment it if we're going to drop it. That should be resolved with this slight change to your patch (which also helps w/ the spin-locking). What do you/others think? diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 0a000ecf0881..a17cb63b3fd5 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -1600,6 +1600,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) srb_t *sp; struct qla_hw_data *ha = vha->hw; struct req_que *req; + struct scsi_cmnd *scmd; qlt_host_reset_handler(ha); @@ -1613,10 +1614,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) { sp = req->outstanding_cmds[cnt]; if (sp) { + scmd = GET_CMD_SP(sp); + /* Don't abort commands in adapter during EEH * recovery as it's not accessible/responding. */ - if (!ha->flags.eeh_busy) { + if (scmd && !ha->flags.eeh_busy) { /* Get a reference to the sp and drop the lock. * The reference ensures this sp->done() call * - and not the call in qla2xxx_eh_abort() - @@ -1624,7 +1627,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) */ sp_get(sp); spin_unlock_irqrestore(>hardware_lock, flags); - qla2xxx_eh_abort(GET_CMD_SP(sp)); + qla2xxx_eh_abort(scmd); spin_lock_irqsave(>hardware_lock, flags); } req->outstanding_cmds[cnt] = NULL; Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: do not requeue requests unaligned with device sector size
On 12/21/2016 05:50 AM, Christoph Hellwig wrote: > How do you even get an unaligned residual count? Except for SES > processor devices (which will only issue BLOCK_PC commands) this is > not allowed by SPC: > > "The residual count shall be reported in bytes if the peripheral device > type in the destination target descriptor is 03h (i.e., processor device), > and in destination device blocks for all other device type codes. On 12/21/2016 06:09 AM, Hannes Reinecke wrote: > Which actually would be pretty much my objection, too. > > This would only be applicable for 512e drives, where we _might_ end up > with a residual smaller than the physical sector size. > But that should be handled by firmware; after all, that's what the 'e' > implies, right? On 12/21/2016 12:01 PM, Martin K. Petersen wrote: I agree with Christoph and Hannes. Some of this falls into the gray area that's outside of the T10 spec (HBA programming interface guarantees) but it seems like a deficiency in the HBA to report a byte count that's not a multiple of the logical block size. A block can't be partially written. Either it made it or it didn't. Regardless of how the I/O is being broken up into frames at the transport level and at which offset the transfer was interrupted. Christoph, Hannes, Martin, Thank you all for your comments and pointers to the documentation/spec. I'll carry it on with the HBA and storage folks. cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] scsi: do not requeue requests unaligned with device sector size
if=/dev/sda of=/dev/null bs=4096 count=1 skip=16909064 iflag=direct 2>/dev/null; dmesg [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272512, count=8 [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done. [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment (sector size 4096, remainder 3072, resid 1024) [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done. [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment (sector size 4096, remainder 3072, resid 1024) [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done. [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment (sector size 4096, remainder 3072, resid 1024) [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 4096 bytes done. [...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done. Apologies for the ridiculously long commit message with description and test-cases, but this problem has been relatively difficult to reproduce and understand, so I thought the documentation/instructions for working with that could be helpful for others too. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/scsi_lib.c | 41 + 1 file changed, 41 insertions(+) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 2cca9cffc63f..190ab28cfb2d 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -821,6 +821,45 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) } /* +* If the SCSI command is successful but has residual bytes, +* align good_bytes to the device sector size to ensure the +* requeued request (to complete the remainder transfer) is +* also aligned and does not fail alignment/size validation. +*/ + if (unlikely(!result && scsi_get_resid(cmd) && +req->cmd_type == REQ_TYPE_FS)) { + + unsigned int sector_size = cmd->device->sector_size; + unsigned int remainder = good_bytes % sector_size; + int resid = scsi_get_resid(cmd); + + SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, cmd, + "checking %d bytes for alignment " + "(sector size %u, remainder %u, resid %d)\n", + good_bytes, sector_size, remainder, resid)); + + if (remainder) { + good_bytes -= remainder; + resid += remainder; + scsi_set_resid(cmd, resid); + + /* +* If less than one device sector has completed, retry the +* request after delay (up to the retry limit) or timeout. +*/ + if (!good_bytes) { + if ((++cmd->retries) <= cmd->allowed + && !scsi_noretry_cmd(cmd)) { + goto delayed_retry; + } else { + set_host_byte(cmd, DID_TIME_OUT); + goto error; + } + } + } + } + + /* * special case: failed zero length commands always need to * drop down into the retry code. Otherwise, if we finished * all bytes in the request we are done now. @@ -845,6 +884,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) if (result == 0) goto requeue; + error: error = __scsi_error_from_host_byte(cmd, result); if (host_byte(result) == DID_RESET) { @@ -985,6 +1025,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) __scsi_queue_insert(cmd, SCSI_MLQUEUE_EH_RETRY, 0); break; case ACTION_DELAYED_RETRY: + delayed_retry: /* Retry the same command after a delay */ __scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY, 0); break; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] block: allow WRITE_SAME commands with the SG_IO ioctl
The WRITE_SAME commands are not present in the blk_default_cmd_filter write_ok list, and thus are failed with -EPERM when the SG_IO ioctl() is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users). [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ] The problem can be reproduced with the sg_write_same command # sg_write_same --num 1 --xferlen 512 /dev/sda # # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_same --num 1 --xferlen 512 /dev/sda' Write same: pass through os error: Operation not permitted # For comparison, the WRITE_VERIFY command does not observe this problem, since it is in that list: # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda' # So, this patch adds the WRITE_SAME commands to the list, in order for the SG_IO ioctl to finish successfully: # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_same --num 1 --xferlen 512 /dev/sda' # That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices (qemu "-device scsi-block" [1], libvirt "" [2]), which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu). In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls, which are translated to write-same calls in the guest kernel, and then into SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest: [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current] [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00 [...] blk_update_request: I/O error, dev sda, sector 17096824 Links: [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52 [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device') Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> Signed-off-by: Brahadambal Srinivasan <la...@linux.vnet.ibm.com> Reported-by: Manjunatha H R <manju...@in.ibm.com> --- block/scsi_ioctl.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 0774799..c6fee74 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -182,6 +182,9 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) __set_bit(WRITE_16, filter->write_ok); __set_bit(WRITE_LONG, filter->write_ok); __set_bit(WRITE_LONG_2, filter->write_ok); + __set_bit(WRITE_SAME, filter->write_ok); + __set_bit(WRITE_SAME_16, filter->write_ok); + __set_bit(WRITE_SAME_32, filter->write_ok); __set_bit(ERASE, filter->write_ok); __set_bit(GPCMD_MODE_SELECT_10, filter->write_ok); __set_bit(MODE_SELECT, filter->write_ok); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()
On 11/23/2016 12:12 PM, Johannes Thumshirn wrote: Looks good and sorry for the bug, Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de> Thanks for the quick review. Not a problem! This problem turned out to be a good learning exercise. :) -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()
Due credit; an oversight. On 11/23/2016 10:33 AM, Mauricio Faria de Oliveira wrote: Reported-by: Harsha Thyagaraja <hathy...@in.ibm.com> Cc: sta...@vger.kernel.org # v4.8 Fixes: 22466da5b4b7 ("lpfc: Fix possible NULL pointer dereference") Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()
The BUG_ON() recently introduced in lpfc_sli_ringtxcmpl_put() is hit in the lpfc_els_abort() > lpfc_sli_issue_abort_iotag() > lpfc_sli_abort_iotag_issue() function path [similar names], due to 'piocb->vport == NULL': BUG_ON(!piocb || !piocb->vport); This happens because lpfc_sli_abort_iotag_issue() doesn't set the 'abtsiocbp->vport' pointer -- but this is not the problem. Previously, lpfc_sli_ringtxcmpl_put() accessed 'piocb->vport' only if 'piocb->iocb.ulpCommand' is neither CMD_ABORT_XRI_CN nor CMD_CLOSE_XRI_CN, which are the only possible values for lpfc_sli_abort_iotag_issue(): lpfc_sli_ringtxcmpl_put(): if ((unlikely(pring->ringno == LPFC_ELS_RING)) && (piocb->iocb.ulpCommand != CMD_ABORT_XRI_CN) && (piocb->iocb.ulpCommand != CMD_CLOSE_XRI_CN) && (!(piocb->vport->load_flag & FC_UNLOADING))) lpfc_sli_abort_iotag_issue(): if (phba->link_state >= LPFC_LINK_UP) iabt->ulpCommand = CMD_ABORT_XRI_CN; else iabt->ulpCommand = CMD_CLOSE_XRI_CN; So, this function path would not have hit this possible NULL pointer dereference before. In order to fix this regression, move the second part of the BUG_ON() check prior to the pointer dereference that it does check for. For reference, this is the stack trace observed. The problem happened because an unsolicited event was received - a PLOGI was received after our PLOGI was issued but not yet complete, so the discovery state machine goes on to sw-abort our PLOGI. kernel BUG at drivers/scsi/lpfc/lpfc_sli.c:1326! Oops: Exception in kernel mode, sig: 5 [#1] <...> NIP [...] lpfc_sli_ringtxcmpl_put+0x1c/0xf0 [lpfc] LR [...] __lpfc_sli_issue_iocb_s4+0x188/0x200 [lpfc] Call Trace: [...] [...] __lpfc_sli_issue_iocb_s4+0xb0/0x200 [lpfc] (unreliable) [...] [...] lpfc_sli_issue_abort_iotag+0x2b4/0x350 [lpfc] [...] [...] lpfc_els_abort+0x1a8/0x4a0 [lpfc] [...] [...] lpfc_rcv_plogi+0x6d4/0x700 [lpfc] [...] [...] lpfc_rcv_plogi_plogi_issue+0xd8/0x1d0 [lpfc] [...] [...] lpfc_disc_state_machine+0xc0/0x2b0 [lpfc] [...] [...] lpfc_els_unsol_buffer+0xcc0/0x26c0 [lpfc] [...] [...] lpfc_els_unsol_event+0xa8/0x220 [lpfc] [...] [...] lpfc_complete_unsol_iocb+0xb8/0x138 [lpfc] [...] [...] lpfc_sli4_handle_received_buffer+0x6a0/0xec0 [lpfc] [...] [...] lpfc_sli_handle_slow_ring_event_s4+0x1c4/0x240 [lpfc] [...] [...] lpfc_sli_handle_slow_ring_event+0x24/0x40 [lpfc] [...] [...] lpfc_do_work+0xd88/0x1970 [lpfc] [...] [...] kthread+0x108/0x130 [...] [...] ret_from_kernel_thread+0x5c/0xbc <...> Cc: sta...@vger.kernel.org # v4.8 Fixes: 22466da5b4b7 ("lpfc: Fix possible NULL pointer dereference") Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/lpfc/lpfc_sli.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c index c5326055beee..f4f77c5b0c83 100644 --- a/drivers/scsi/lpfc/lpfc_sli.c +++ b/drivers/scsi/lpfc/lpfc_sli.c @@ -1323,18 +1323,20 @@ struct lpfc_iocbq * { lockdep_assert_held(>hbalock); - BUG_ON(!piocb || !piocb->vport); + BUG_ON(!piocb); list_add_tail(>list, >txcmplq); piocb->iocb_flag |= LPFC_IO_ON_TXCMPLQ; if ((unlikely(pring->ringno == LPFC_ELS_RING)) && (piocb->iocb.ulpCommand != CMD_ABORT_XRI_CN) && - (piocb->iocb.ulpCommand != CMD_CLOSE_XRI_CN) && - (!(piocb->vport->load_flag & FC_UNLOADING))) - mod_timer(>vport->els_tmofunc, - jiffies + - msecs_to_jiffies(1000 * (phba->fc_ratov << 1))); + (piocb->iocb.ulpCommand != CMD_CLOSE_XRI_CN)) { + BUG_ON(!piocb->vport); + if (!(piocb->vport->load_flag & FC_UNLOADING)) + mod_timer(>vport->els_tmofunc, + jiffies + + msecs_to_jiffies(1000 * (phba->fc_ratov << 1))); + } return 0; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qla2xxx: do not abort all commands in the adapter during EEH recovery
The previous commit ("qla2xxx: fix invalid DMA access after command aborts in PCI device remove") introduced a regression during an EEH recovery, since the change to the qla2x00_abort_all_cmds() function calls qla2xxx_eh_abort(), which verifies the EEH recovery condition but handles it heavy-handed. (commit a465537ad1a4 "qla2xxx: Disable the adapter and skip error recovery in case of register disconnect.") This problem warrants a more general/optimistic solution right into qla2xxx_eh_abort() (eg in case a real command abort arrives during EEH recovery, or if it takes long enough to trigger command aborts); but it's still worth to add a check to ensure the code added by the previous commit is correct and contained within its owner function. This commit just adds a 'if (!ha->flags.eeh_busy)' check around it. (ahem; a trivial fix for this -rc series; sorry for this oversight.) With it applied, both PCI device remove and EEH recovery works fine. Fixes: 1535aa75a3d8 ("scsi: qla2xxx: fix invalid DMA access after command aborts in PCI device remove") Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/qla2xxx/qla_os.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 567fa080e261..56d6142852a5 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -1456,15 +1456,20 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) { sp = req->outstanding_cmds[cnt]; if (sp) { - /* Get a reference to the sp and drop the lock. -* The reference ensures this sp->done() call -* - and not the call in qla2xxx_eh_abort() - -* ends the SCSI command (with result 'res'). + /* Don't abort commands in adapter during EEH +* recovery as it's not accessible/responding. */ - sp_get(sp); - spin_unlock_irqrestore(>hardware_lock, flags); - qla2xxx_eh_abort(GET_CMD_SP(sp)); - spin_lock_irqsave(>hardware_lock, flags); + if (!ha->flags.eeh_busy) { + /* Get a reference to the sp and drop the lock. +* The reference ensures this sp->done() call +* - and not the call in qla2xxx_eh_abort() - +* ends the SCSI command (with result 'res'). +*/ + sp_get(sp); + spin_unlock_irqrestore(>hardware_lock, flags); + qla2xxx_eh_abort(GET_CMD_SP(sp)); + spin_lock_irqsave(>hardware_lock, flags); + } req->outstanding_cmds[cnt] = NULL; sp->done(vha, sp, res); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from lpfc_send_taskmgmt()
On 07/22/2016 05:18 PM, Martin K. Petersen wrote: Applied to 4.8/scsi-queue. Can this be routed into stable? The 'Fixes:' commit was introduced in v4.2, and there's longterm v4.4 and stables v4.5 and v4.6 after that. Given an oops (which may panic depending on configuration) got fixed, and it happens in normal scenarios (eg SCSI EH), it seems appropriate. Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] lpfc: support for CPU phys_id and core_id on PowerPC64
Hi Martin, On 06/21/2016 11:16 PM, Martin K. Petersen wrote: Distros are free to carry a patch such as yours. That puts the burden on them and not on upstream which is going in a different direction as outlined by Christoph. This is ultimately Broadcom's decision. It is their driver. Right, I understand. I submitted the patch in case they see value in having it upstream, as some distros discuss/ask about what's the status (or differences to) upstream. In some cases, a rationale like this one being documented on a mailing list is sufficient, provided the patch hasn't received other technical problems, for example. Thanks for the review/comments (Christoph too), -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] lpfc: support for CPU phys_id and core_id on PowerPC64
On 06/21/2016 11:29 AM, Christoph Hellwig wrote: I really don't think this sort of low level information has business in a low level driver. I've been trying to put some infrastructure together to move this to the core kernel [1], and it would be good to help use this in more drivers. Especially given that lpfc may use blk-mq which will not respect this current assignment for the queue mapping. Agree. However, this patchset targets the non blk-mq/scsi-mq usage of the lpfc driver, which falls back to CPU-number/queue-number mapping. This turns out to still be the case w/ some distros where ppc64/le usually runs, on which it would be easier to adapt this relatively small change than moving forward w/ blk-mq/scsi-mq, for example -- even if the latter is clearly a superior approach. [1] http://lists.infradead.org/pipermail/linux-nvme/2016-June/005012.html -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] lpfc: PowerPC64 topology information, and per-core scheduling
On 06/01/2016 05:43 PM, Mauricio Faria de Oliveira wrote: Tested on next-20160601 (with an extra commit for patch 1/2, see commit msg). FYI, that commit has been accepted into powerpc next [1]. [1] https://git.kernel.org/powerpc/c/f8ab481066e7246e4b272233aa -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block: don't check request size in blk_cloned_rq_check_limits()
Hey, On 06/15/2016 01:34 PM, Brian King wrote: Mauricio was looking at this, adding him to cc. We did have a KVM config where we could reproduce this issue as well, I think with some PCI passthrough adapters. Mauricio - do you have any more details about the KVM config that reproduced this issue and did you ever try to reproduce this with an upstream kernel? It's KVM guest w/ SLES 12 SP1 + kernel updates (3.12.53-60.30 introduced the error) and PCI passthrough of Emulex FC and FCoE adapters, connected to 4 storage systems (DS8000, XIV, SVC, and FlashSystem 840). It seems only the XIV LUNs never hit the problem, but it didn't seem to be storage-specific, as FS840 LUNs had a mix of hit/not-hit the problem. One thing is that all paths of a LUN were either failed or not - no mix within a LUN. Unfortunately not too much analysis was performed on this system at the time -- Hannes had already made good progress w/ the customer, and some test kernel builds that resolved the issue were made available soon. Now that the topic is under discussion, I've asked for some time slots on that system, so we can test an upstream kernel and try to reproduce the problem and analyze it more closely. -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from lpfc_send_taskmgmt()
(x0) to 839909376 (x3210) lpfc: skip scsi_done() <...> lpfc 0006:01:00.4: 4:(0):0713 SCSI layer issued Device Reset (0, 0) return x2002 <...> lpfc 0006:01:00.4: 4:(0):0723 SCSI layer issued Target Reset (1, 0) return x2002 <...> lpfc 0006:01:00.4: 4:(0):0714 SCSI layer issued Bus Reset Data: x2002 <...> lpfc 0006:01:00.4: 4:(0):3172 SCSI layer issued Host Reset Data: <...> Fixes: 8b0dff14164d ("lpfc: Add support for using block multi-queue") Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/lpfc/lpfc_scsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c index 3bd0be6..c7e5695 100644 --- a/drivers/scsi/lpfc/lpfc_scsi.c +++ b/drivers/scsi/lpfc/lpfc_scsi.c @@ -3874,7 +3874,7 @@ int lpfc_sli4_scmd_to_wqidx_distr(struct lpfc_hba *phba, uint32_t tag; uint16_t hwq; - if (shost_use_blk_mq(cmnd->device->host)) { + if (cmnd && shost_use_blk_mq(cmnd->device->host)) { tag = blk_mq_unique_tag(cmnd->request); hwq = blk_mq_unique_tag_to_hwq(tag); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] lpfc: support for CPU phys_id and core_id on PowerPC64
This commit uses the macros 'topology_physical_package_id' and 'topology_core_id' from , which are defined in arch/powerpc/include/asm/topology.h for CONFIG_PPC64 && CONFIG_SMP. Also, change the initial value for min_phys_id from 0xff to INT_MAX (the numbers may increment with large steps on some systems). While in there, include the CPU number in the debug message, which helps reading it on systems with many CPUs. This depends on commit 'powerpc: export cpu_to_core_id()' (submitted to the linuxppc-dev mailing list). Tested on next-20160601 w/ commit. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/lpfc/lpfc_init.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c index b43f7ac..03d1946 100644 --- a/drivers/scsi/lpfc/lpfc_init.c +++ b/drivers/scsi/lpfc/lpfc_init.c @@ -34,6 +34,8 @@ #include #include #include +#include +#include #include #include @@ -8718,7 +8720,7 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors) phba->sli4_hba.num_present_cpu)); max_phys_id = 0; - min_phys_id = 0xff; + min_phys_id = INT_MAX; phys_id = 0; num_io_channel = 0; first_cpu = LPFC_VECTOR_MAP_EMPTY; @@ -8730,6 +8732,9 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors) cpuinfo = _data(cpu); cpup->phys_id = cpuinfo->phys_proc_id; cpup->core_id = cpuinfo->cpu_core_id; +#elif defined(CONFIG_PPC64) && defined(CONFIG_SMP) + cpup->phys_id = topology_physical_package_id(cpu); + cpup->core_id = topology_core_id(cpu); #else /* No distinction between CPUs for other platforms */ cpup->phys_id = 0; @@ -8737,8 +8742,8 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors) #endif lpfc_printf_log(phba, KERN_INFO, LOG_INIT, - "3328 CPU physid %d coreid %d\n", - cpup->phys_id, cpup->core_id); + "3328 CPU %d physid %d coreid %d\n", + cpu, cpup->phys_id, cpup->core_id); if (cpup->phys_id > max_phys_id) max_phys_id = cpup->phys_id; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] lpfc: PowerPC64 topology information, and per-core scheduling
This series enables the association of IO channels at the CPU-core level (i.e., with all CPUs/threads of core), which may increase performance depending on the CPU affinity settings of the workload (e.g., benefit from the per-core caches); and adds the required pieces to use it on PowerPC64 too (topology information), which has server processors with many cores/threads and per-core caches. Although the series include bits for PowerPC64, the per-core scheduling patch is architecture independent. Tested on next-20160601 (with an extra commit for patch 1/2, see commit msg). Mauricio Faria de Oliveira (2): lpfc: support for CPU phys_id and core_id on PowerPC64 lpfc: add option for lpfc_fcp_io_sched (LPFC_FCP_SCHED_BY_CPU_CORE) drivers/scsi/lpfc/lpfc_attr.c | 8 -- drivers/scsi/lpfc/lpfc_hw4.h | 1 + drivers/scsi/lpfc/lpfc_init.c | 65 --- drivers/scsi/lpfc/lpfc_scsi.c | 3 +- 4 files changed, 70 insertions(+), 7 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] lpfc: add option for lpfc_fcp_io_sched (LPFC_FCP_SCHED_BY_CPU_CORE)
This commit introduces the new value 2 for parameter 'lpfc_fcp_io_sched', which associates work queues (or IO channels) by core, that is, all CPUs (or threads) of a particular core are assigned the same IO channel. The IO channels are assigned to each core in a round-robin fashion. Tested on next-20160601. Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com> --- drivers/scsi/lpfc/lpfc_attr.c | 8 +-- drivers/scsi/lpfc/lpfc_hw4.h | 1 + drivers/scsi/lpfc/lpfc_init.c | 54 ++- drivers/scsi/lpfc/lpfc_scsi.c | 3 ++- 4 files changed, 62 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c index cfec2ec..f8db2ad 100644 --- a/drivers/scsi/lpfc/lpfc_attr.c +++ b/drivers/scsi/lpfc/lpfc_attr.c @@ -4517,12 +4517,16 @@ LPFC_ATTR_R(ack0, 0, 0, 1, "Enable ACK0 support"); # For [0], FCP commands are issued to Work Queues ina round robin fashion. # For [1], FCP commands are issued to a Work Queue associated with the # current CPU. +# For [2], FCP commands are issued to a Work Queue associated with the +# current CPU core (same Work Queue for all CPUs of a core). # It would be set to 1 by the driver if it's able to set up cpu affinity # for FCP I/Os through Work Queue associated with the current CPU. Otherwise, # roundrobin scheduling of FCP I/Os through WQs will be used. +# It would remain set to 2 by the driver, likewise, if 2 was requested. */ -LPFC_ATTR_RW(fcp_io_sched, 0, 0, 1, "Determine scheduling algorithm for " - "issuing commands [0] - Round Robin, [1] - Current CPU"); +LPFC_ATTR_RW(fcp_io_sched, 0, 0, 2, "Determine scheduling algorithm for " + "issuing commands [0] - Round Robin, [1] - Current CPU, " + "[2] - Current CPU core"); /* # lpfc_fcp2_no_tgt_reset: Determine bus reset behavior diff --git a/drivers/scsi/lpfc/lpfc_hw4.h b/drivers/scsi/lpfc/lpfc_hw4.h index 0c7070b..3c63063 100644 --- a/drivers/scsi/lpfc/lpfc_hw4.h +++ b/drivers/scsi/lpfc/lpfc_hw4.h @@ -191,6 +191,7 @@ struct lpfc_sli_intf { /* Algrithmns for scheduling FCP commands to WQs */ #defineLPFC_FCP_SCHED_ROUND_ROBIN 0 #defineLPFC_FCP_SCHED_BY_CPU 1 +#defineLPFC_FCP_SCHED_BY_CPU_CORE 2 /* Delay Multiplier constant */ #define LPFC_DMULT_CONST 651042 diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c index 03d1946..917351c 100644 --- a/drivers/scsi/lpfc/lpfc_init.c +++ b/drivers/scsi/lpfc/lpfc_init.c @@ -8703,6 +8703,7 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors) { int i, idx, saved_chann, used_chann, cpu, phys_id; int max_phys_id, min_phys_id; + int max_core_id, min_core_id, core_id, prev_core_id; int num_io_channel, first_cpu, chan; struct lpfc_vector_map_info *cpup; #ifdef CONFIG_X86 @@ -8722,6 +8723,9 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors) max_phys_id = 0; min_phys_id = INT_MAX; phys_id = 0; + max_core_id = 0; + min_core_id = INT_MAX; + core_id = 0; num_io_channel = 0; first_cpu = LPFC_VECTOR_MAP_EMPTY; @@ -8749,6 +8753,12 @@ lpfc_sli4_set_affinity(struct lpfc_hba *phba, int vectors) max_phys_id = cpup->phys_id; if (cpup->phys_id < min_phys_id) min_phys_id = cpup->phys_id; + + if (cpup->core_id > max_core_id) + max_core_id = cpup->core_id; + if (cpup->core_id < min_core_id) + min_core_id = cpup->core_id; + cpup++; } @@ -8820,6 +8830,46 @@ found: } /* +* With lpfc_fcp_io_sched per core, associate IO channels with CPUs +* based only on the core numbers instead of individual CPU numbers, +* and ignore/overwrite already assigned values (from MSI-x vectors). +*/ + if (phba->cfg_fcp_io_sched == LPFC_FCP_SCHED_BY_CPU_CORE) { + + /* The IO channel used by a core */ + chan = -1; + + /* For each core, assign its (sequential) CPUs the same IO channel */ + prev_core_id = -1; + for (core_id = min_core_id; core_id <= max_core_id; core_id++) { + + cpup = phba->sli4_hba.cpu_map; + for (cpu = 0; cpu < phba->sli4_hba.num_present_cpu; cpu++, cpup++) { + + if (cpup->core_id != core_id) + continue; + + /* Round-robin on different cores */ + if (core_id != prev_core_id) { + prev_core_id = core_id; + chan = (chan + 1)
Re: [dm-devel] IBM request to allow unprivledged ioctls [Was: Revert "dm mpath: fix stalls when handling invalid ioctls"]
Hi Mike, On 10/29/2015 11:18 AM, Mike Snitzer wrote: Sorry but I fail to see why your request to revert is viable. No problem. Thanks for moving this for a discussion on a proper fix. I'm somewhat new to kernel and SCSI workings and its community process, so that's certainly appreciated. Wouldn't the correct fix be to train scsi_verify_blk_ioctl() to work even without SYS_CAP_RAWIO? I see it would fix the problem as well, but I don't happen to know all of its usages, so I'd have to defer to question to someone who knows more of it. I'm not doubting that the commit caused problems for the case you care about (unprivledged users issueing ioctls) but that is an entirely different issue that needs to be discussed more directly (with the broader linux-scsi community, now cc'd). Sure. Thanks for opening the discussion. p.s. leaving full context for linux-scsi so they don't need to track down other mails (btw, thanks for the detailed patch header but it enabled me to be skeptical of your request to revert): You're welcome. If it's been useful for rejecting this patch and getting a better one later, it's worth it. :) Kind regards, -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [RESEND] qla2xxx: prevent board_disable from running during EEH
Commit f3ddac1918fe963bcbf8d407a3a3c0881b47248b ([SCSI] qla2xxx: Disable adapter when we encounter a PCI disconnect.) has introduced a code that disables the board, releasing some resources, when reading 0x. In case this happens when there is an EEH, this read will trigger EEH detection and set PCI channel offline. EEH will be able to recover the card from this state by doing a reset, so it's a better option than simply disabling the card. Since eeh_check_failure will mark the channel as offline before returning the read value, in case there really was an EEH, we can simply check for pci_channel_offline, preventing the board_disable code from running if it's true. Without this patch, EEH code will try to access those same resources that board_disable will try to free. This race can cause EEH recovery to fail. [ 504.370577] EEH: Notify device driver to resume [ 504.370580] qla2xxx [0001:07:00.0]-9002:2: The device failed to resume I/O from slot/link_reset. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Cc: casca...@debian.org Cc: mauri...@linux.vnet.ibm.com --- drivers/scsi/qla2xxx/qla_isr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c index a04a1b1..8132926 100644 --- a/drivers/scsi/qla2xxx/qla_isr.c +++ b/drivers/scsi/qla2xxx/qla_isr.c @@ -116,7 +116,7 @@ bool qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg) { /* Check for PCI disconnection */ - if (reg == 0x) { + if (reg == 0x !pci_channel_offline(vha-hw-pdev)) { if (!test_and_set_bit(PFLG_DISCONNECTED, vha-pci_flags) !test_bit(PFLG_DRIVER_REMOVING, vha-pci_flags) !test_bit(PFLG_DRIVER_PROBING, vha-pci_flags)) { -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html