Re: SCSI error handling -- one error blocks the whole SCSI host
On 5/27/2013 8:32 PM, Baruch Even wrote: necessary but the command itself if it is already actively handled continues in its path. The abort only cancels those commands that are in the queue and if there really was a problem and the disk is engaging in error recovery of its own you'll just have no response from it and it will seem dead (abort may timeout). Yes, the abort seems to be handled more like a hint in many cases. Having coded a couple targets, abort handling is often _REALLY_ hard to get 100% right. Especially, when its an actual error that is causing the delay, rather than a correctly functional long running command. That said, I've seen devices actually respond to aborts on tape ERASE and similar commands by actually aborting the command as one would expect. So it does sometimes work.. Besides abort timeouts (which is major bad karma) the abort may be accepted, and the next non inquiry/tur type command that gets queued simply blocks waiting for the abort to internally complete. From the target device perspective, if you don't send a response for ABTS out in 2*RA_TOV then your problems start to multiply. So it encourages the target devices to treat aborts in an async manner. As you said, the device simply finds the indicated command on a queue, marks it as being aborted and hopes whatever is processing the command notices and terminates its operation. On subsequent commands the nicer devices will notice the abort hasn't completed and return becoming ready or similar in response to TUR/etc for some number of minutes. This view of aborts also means that reducing timeouts for commands and TMFs is mostly useless and sometimes even a really bad idea. I prefer to just let the device go on with its error recovery and just forget about the command. I want to forget about the DMA so I issue an abort but anything higher than that means a link is dead to me. Well, invariably the manufactures have timeouts that are really long and based on internal error recovery logic. See http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003556aid=1 page 468. Notice the timeouts are specified in minutes, not seconds. Furthermore, the commands that normally complete in fractions of a second have actual timeouts that can be tens of minutes (READ/WRITE for example). So, doing anything before that timeout has expired is a good way to knock the device offline. Some of the newer disks have mode page options to shorten their read/write error recovery, but short error recovery can still be many tens of seconds rather than a couple minutes. Plus, it doesn't help compound commands like SYNCHRONIZE CACHE which may take multiple errors during operation. This is another part of what formed my opinions about error isolation. If one of your devices goes out to lunch and isn't recovering via abort/lun reset. Its done! Wrecking the rest of the SAN doing bus resets and HBA resets is a good way to take a serious problem and turn it into a full blown catastrophe. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI error handling -- one error blocks the whole SCSI host
On Tue, May 28, 2013 at 5:38 PM, Jeremy Linton jlin...@tributary.com wrote: This is another part of what formed my opinions about error isolation. If one of your devices goes out to lunch and isn't recovering via abort/lun reset. Its done! Wrecking the rest of the SAN doing bus resets and HBA resets is a good way to take a serious problem and turn it into a full blown catastrophe. This is the gist of the issue, once you got to an abort you are screwed already. You need the abort but anything else should be reserved to when things are really dead (the HBA might still recover on a host reset, but only do it if the host is really unresponsive). That's why I prefer to have a long timeout for the command and a long timeout for the abort. The application above should handle itself with its own timeout once the abort was sent (the buffer remains locked until the abort returns). The device itself is likely stuck in error recovery and it will come out of it when its own internal timeouts are exhausted which can be infinite and will generally be very large. Baruch -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI error handling -- one error blocks the whole SCSI host
On 05/27/2013 12:44 AM, James Bottomley wrote: On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote: At LSF this year, we had a discussion about error handling and in particular the problem that SCSI midlayer error handling waits for the entire SCSI host (HBA) to quiesce before it starts to abort commands etc. James made the suggestion that FC should handle things the way SAS does, because SAS has a strategy handler that does things the right way. However, now that I finally sit down and look at the code, I don't see how this is the case. It seems inherent in the way that scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in particular the strategy handler can't even be called until host_failed == host_busy; we don't bump host_failed without SHOST_RECOVERY set, which stops queueing commands to any devices attached to the whole HBA). James, am I understanding your suggestion properly? If so can you explain what you meant about the libsas code -- I see that it has its own strategy handler but as I said before we've already stopped every device attached to the HBA before we ever get there. It is, but I checked: Apparently it's not implemented in the sas transport class. The original discussion when libsas was constructed, as I remember it, was about using the scsi timeout handler to implement a running abort. The idea is fairly simple: you use the first fire of eh_timed_out to trigger the abort (or LUN reset) while simultaneously returning BLK_EH_RESET_TIMER. If the timer fires again and the abort hasn't returned, you escalate, otherwise you resend the command when the abort returns. This allows you to handle single command failures (up to LUN reset) without stopping the host. Obviously, if you have to escalate to device reset, then you need to start the eh thread. There are some problems with that: - Returning BLK_EH_RESET_TIMER will restart the timer with the _default_ blk timeout. Whereas the _abort_ timeout might (and, for some LLDDs, it definitely is) different from that. - Leaving the command running while abort is active will inevitably risk a double completion on the original command; the command abort might terminate the command at the same time as the (real) completion comes in. 'Normal' command timeouts are protected against this via REQ_ATOM_COMPLETE; commands aborted via scsi_finish_cmnd() are not. - LLDDs typically won't return a command status even for a command which has been aborted via ABORT TASK TMF. So the midlayer probably will never get notified if the command got aborted via ABORT TASK. Especially the last point made me abandon this idea for my EH rewrite. We would be having a real benefit if we somehow could get the command status _from the target_ for an aborted command. But as it appears we won't. So as any status is made up anyway I'd very much prefer to have it set by the midlayer. Which renders the whole operation quite pointless and we're better off using the existing syntax for command aborts. Plus it makes life _so much_ easier for the implementation ... But to answer Roland: Have you checked my patchset? It should help for command timeouts ... Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI error handling -- one error blocks the whole SCSI host
On Mon, 2013-05-27 at 16:39 +0200, Hannes Reinecke wrote: On 05/27/2013 12:44 AM, James Bottomley wrote: On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote: At LSF this year, we had a discussion about error handling and in particular the problem that SCSI midlayer error handling waits for the entire SCSI host (HBA) to quiesce before it starts to abort commands etc. James made the suggestion that FC should handle things the way SAS does, because SAS has a strategy handler that does things the right way. However, now that I finally sit down and look at the code, I don't see how this is the case. It seems inherent in the way that scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in particular the strategy handler can't even be called until host_failed == host_busy; we don't bump host_failed without SHOST_RECOVERY set, which stops queueing commands to any devices attached to the whole HBA). James, am I understanding your suggestion properly? If so can you explain what you meant about the libsas code -- I see that it has its own strategy handler but as I said before we've already stopped every device attached to the HBA before we ever get there. It is, but I checked: Apparently it's not implemented in the sas transport class. The original discussion when libsas was constructed, as I remember it, was about using the scsi timeout handler to implement a running abort. The idea is fairly simple: you use the first fire of eh_timed_out to trigger the abort (or LUN reset) while simultaneously returning BLK_EH_RESET_TIMER. If the timer fires again and the abort hasn't returned, you escalate, otherwise you resend the command when the abort returns. This allows you to handle single command failures (up to LUN reset) without stopping the host. Obviously, if you have to escalate to device reset, then you need to start the eh thread. There are some problems with that: - Returning BLK_EH_RESET_TIMER will restart the timer with the _default_ blk timeout. Whereas the _abort_ timeout might (and, for some LLDDs, it definitely is) different from that. Right ... you don't reuse the command, you have to start a new one. libsas actually has a task abstraction, which is what you use to send TMFs. - Leaving the command running while abort is active will inevitably risk a double completion on the original command; the command abort might terminate the command at the same time as the (real) completion comes in. 'Normal' command timeouts are protected against this via REQ_ATOM_COMPLETE; commands aborted via scsi_finish_cmnd() are not. That's not a bug, it's a requirement. The way you handle commands in a running abort or LUN reset is only in the status return code from the command, so you have to tie the success of the eh action to the base command and return DID_ABORT (or DID_RESET) in the actual command ... this is how retries get done without troubling the error handler. Essentially, this requires a low level tie with the HBA machine description of the command, which is what avoids double completion. - LLDDs typically won't return a command status even for a command which has been aborted via ABORT TASK TMF. So the midlayer probably will never get notified if the command got aborted via ABORT TASK. Well, that's true, but irrelevant. If the HBA can't inform you of the status of the abort, then abort is useless as a first step in the traditional eh as well as in this method, so you just don't do that and proceed to resets. There's actually a school of thought that says even if the HBA *can* give you all the status you need, aborts are still pointless because it's sending in yet another state transition to an already failed state machine (because the device is timing out). Therefore, since the chance of recovering the state machine with an abort is so tiny, you should start with the lowest reset anyway because that takes the state machine to a known state. James Especially the last point made me abandon this idea for my EH rewrite. We would be having a real benefit if we somehow could get the command status _from the target_ for an aborted command. But as it appears we won't. So as any status is made up anyway I'd very much prefer to have it set by the midlayer. Which renders the whole operation quite pointless and we're better off using the existing syntax for command aborts. Plus it makes life _so much_ easier for the implementation ... But to answer Roland: Have you checked my patchset? It should help for command timeouts ... Cheers, Hannes -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI error handling -- one error blocks the whole SCSI host
On Mon, May 27, 2013 at 11:41 PM, James Bottomley james.bottom...@hansenpartnership.com wrote: On Mon, 2013-05-27 at 16:39 +0200, Hannes Reinecke wrote: - LLDDs typically won't return a command status even for a command which has been aborted via ABORT TASK TMF. So the midlayer probably will never get notified if the command got aborted via ABORT TASK. Well, that's true, but irrelevant. If the HBA can't inform you of the status of the abort, then abort is useless as a first step in the traditional eh as well as in this method, so you just don't do that and proceed to resets. There's actually a school of thought that says even if the HBA *can* give you all the status you need, aborts are still pointless because it's sending in yet another state transition to an already failed state machine (because the device is timing out). Therefore, since the chance of recovering the state machine with an abort is so tiny, you should start with the lowest reset anyway because that takes the state machine to a known state. Most devices I know do not really abort the command in any normal sense anyhow. Not even when doing a reset. The disks (HDD SSD) and also SAN systems normally just treat an abort or a reset as a signal that no real reply is necessary but the command itself if it is already actively handled continues in its path. The abort only cancels those commands that are in the queue and if there really was a problem and the disk is engaging in error recovery of its own you'll just have no response from it and it will seem dead (abort may timeout). The one thing aborts/reset help with is to clear your HBA from any pending so that your DMA buffers will no longer be affected and you can forget the command and do your application level recovery (RAID or lose data and panic). It is also an important part of handling bad links but at least in SAS that is done internally in the HBA anyway. This view of aborts also means that reducing timeouts for commands and TMFs is mostly useless and sometimes even a really bad idea. I prefer to just let the device go on with its error recovery and just forget about the command. I want to forget about the DMA so I issue an abort but anything higher than that means a link is dead to me. Baruch -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI error handling -- one error blocks the whole SCSI host
On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote: At LSF this year, we had a discussion about error handling and in particular the problem that SCSI midlayer error handling waits for the entire SCSI host (HBA) to quiesce before it starts to abort commands etc. James made the suggestion that FC should handle things the way SAS does, because SAS has a strategy handler that does things the right way. However, now that I finally sit down and look at the code, I don't see how this is the case. It seems inherent in the way that scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in particular the strategy handler can't even be called until host_failed == host_busy; we don't bump host_failed without SHOST_RECOVERY set, which stops queueing commands to any devices attached to the whole HBA). James, am I understanding your suggestion properly? If so can you explain what you meant about the libsas code -- I see that it has its own strategy handler but as I said before we've already stopped every device attached to the HBA before we ever get there. It is, but I checked: Apparently it's not implemented in the sas transport class. The original discussion when libsas was constructed, as I remember it, was about using the scsi timeout handler to implement a running abort. The idea is fairly simple: you use the first fire of eh_timed_out to trigger the abort (or LUN reset) while simultaneously returning BLK_EH_RESET_TIMER. If the timer fires again and the abort hasn't returned, you escalate, otherwise you resend the command when the abort returns. This allows you to handle single command failures (up to LUN reset) without stopping the host. Obviously, if you have to escalate to device reset, then you need to start the eh thread. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI error handling -- one error blocks the whole SCSI host
Roland, I agree, and am already working around that limitation. -- james s On 5/23/2013 2:14 PM, Roland Dreier wrote: At LSF this year, we had a discussion about error handling and in particular the problem that SCSI midlayer error handling waits for the entire SCSI host (HBA) to quiesce before it starts to abort commands etc. James made the suggestion that FC should handle things the way SAS does, because SAS has a strategy handler that does things the right way. However, now that I finally sit down and look at the code, I don't see how this is the case. It seems inherent in the way that scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in particular the strategy handler can't even be called until host_failed == host_busy; we don't bump host_failed without SHOST_RECOVERY set, which stops queueing commands to any devices attached to the whole HBA). James, am I understanding your suggestion properly? If so can you explain what you meant about the libsas code -- I see that it has its own strategy handler but as I said before we've already stopped every device attached to the HBA before we ever get there. To recapitulate the problem here, we might have a whole fabric attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50 devices. Then a single LUN goes wonky and all the IO stops while we try to recover that single device, which might take minutes. I know this has been discussed before, but can we find a way forward here? Is there some way we can start with per-device error recovery and avoid disrupting IO that we can see is working fine? Thanks, Roland -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SCSI error handling -- one error blocks the whole SCSI host
At LSF this year, we had a discussion about error handling and in particular the problem that SCSI midlayer error handling waits for the entire SCSI host (HBA) to quiesce before it starts to abort commands etc. James made the suggestion that FC should handle things the way SAS does, because SAS has a strategy handler that does things the right way. However, now that I finally sit down and look at the code, I don't see how this is the case. It seems inherent in the way that scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in particular the strategy handler can't even be called until host_failed == host_busy; we don't bump host_failed without SHOST_RECOVERY set, which stops queueing commands to any devices attached to the whole HBA). James, am I understanding your suggestion properly? If so can you explain what you meant about the libsas code -- I see that it has its own strategy handler but as I said before we've already stopped every device attached to the HBA before we ever get there. To recapitulate the problem here, we might have a whole fabric attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50 devices. Then a single LUN goes wonky and all the IO stops while we try to recover that single device, which might take minutes. I know this has been discussed before, but can we find a way forward here? Is there some way we can start with per-device error recovery and avoid disrupting IO that we can see is working fine? Thanks, Roland -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
re :SCSI error handling -- one error blocks the whole SCSI host
James, am I understanding your suggestion properly? If so can you explain what you meant about the libsas code -- I see that it has its own strategy handler but as I said before we've already stopped every device attached to the HBA before we ever get there. To recapitulate the problem here, we might have a whole fabric attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50 devices. Then a single LUN goes wonky and all the IO stops while we try to recover that single device, which might take minutes. I'm not James, but from my experience in pm8001 and libsas, your understanding is right. and when one error happens on one lun, scsi core do hold the whole scsi host. I think Hannes has some good proposal weeks ago, it looks reasonable, but don't what the status now. Regards Jack Wang -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html