Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

Knight, Frederick Fri, 29 Apr 2016 00:02:23 -0700

There are multiple possible situations being intermixed in this discussion.  
First, I assume you're talking only about random access devices (if you try 
transport level error recover on a sequential access device - tape or SMR disk 
- there are lots of additional complexities).

Failures can occur at multiple places:
a) Transport layer failures that the transport layer is able to detect quickly;
b) SCSI device layer failures that the transport layer never even knows about.

For (a) there are two competing goals.  If a port drops off the fabric and 
comes back again, should you be able to just recover and continue.  But how 
long do you wait during that drop?  Some devices use this technique to "move" a 
WWPN from one place to another.  The port drops from the fabric, and a short 
time later, shows up again (the WWPN moves from one physical port to a 
different physical port). There are FC driver layer timers that define the 
length of time allowed for this operation.  The goal is fast failover, but not 
too fast - because too fast will break this kind of "transparent failover".  
This timer also allows for the "OH crap, I pulled the wrong cable - put it back 
in; quick" kind of stupid user bug.

For (b) the transport never has a failure.  A LUN (or a group of LUNs) have an 
ALUA transition from one set of ports to a different set of ports.  Some of the 
LUNs on the port continue to work just fine, but others enter ALUA TRANSITION 
state so they can "move" to a different part of the hardware.  After the move 
completes, you now have different sets of optimized and non-optimized paths (or 
possible standby, or unavailable).  The transport will never even know this 
happened.  This kind of "failure" is handled by the SCSI layer drivers.

There are other cases too, but these are the most common.

        Fred

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Bart Van Assche
Sent: Thursday, April 28, 2016 11:54 AM
To: James Bottomley; Mike Snitzer
Cc: [email protected]; [email protected]; device-mapper 
development; linux-scsi
Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 08:40 AM, James Bottomley wrote:
> Well, the entire room, that's vendors, users and implementors
> complained that path failover takes far too long.  I think in their
> minds this is enough substance to go on.

The only complaints I heard about path failover taking too long came 
from people working on FC drivers. Aren't SCSI transport layer 
implementations expected to fail I/O after fast_io_fail_tmo expired 
instead of waiting until the SCSI error handler has finished? If so, why 
is it considered an issue that error handling for the FC protocol can 
take very long (hours)?

Thanks,

Bart.
_______________________________________________
Lsf mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/lsf

--
dm-devel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

Reply via email to