Chiradeep Vittal wrote: > Very enlightening, thanks. > If there is a constant stream of traffic over the iscsi session and > there is a network failure, > then the scsi eh timer should fire right?
If you have nops turned off then it will. If you have nops on then it could fire, but the iscsi layer will prevent the scsi eh timer from causing the scsi eh from running. With nops on we send a nop every noop_timout seconds. If we have some bad timing and a scsi timer fires while a nop is being sent or right before we want to send a nop to test the network then starting with 871 and upstream kernel 2.6.30 we reset the scsi cmd eh timer and send a nop. If during that reset cmd timer period the nop times out then we would drop the iscsi session and the scsi eh would not run. If the nop runs ok, then when the cmd timer times out again then the scsi eh will run. If there is a network problem while the scsi eh is running, then the scsi eh could fail if we cannot reconnect within replacement_timeout seconds and that would lead to the devices going offline as seen in sysfs below. > And the disk will then go offline (according to /sys/block/<disk>/ > device/state )? > > I think where this is leading to is to use dm-multipath even if there > is only a single path since dm-multipath > will constantly test the link. > > > On Sep 11, 7:53 am, Mike Christie <micha...@cs.wisc.edu> wrote: >> On 09/10/2009 05:23 PM, Chiradeep Vittal wrote: >> >> >> >>> Thanks. I'll take a look at the netlink interface. Not using multipath >>> for now, but will do so later. >>> For basic monitoring of storage network problems, here's what I am >>> thinking: >>> 1. If there is a network failure, eventually cat /sys/block/<disk>/ >>> device/state should show "offline" ? >>> 2. How long will this take? I know that this is a function of >>> replacement_timeout, noop_interval, noop_timeout and scsi timeout, but >>> the relationship is not clear >>> Let us say >>> a=session.timeo.noop_out_interval=5 >>> b=session.timeo.noop_out_timeout=5 >>> c=session.timeo.replacement_timeout=120 >>> d=`cat /sys/block/<disk>/device/timeout`=60 >>> The disk should go offline in a maximum of a+b+c+d=190s after a >>> network failure? >> It is not really that easy, because if the nop times out the iscsi layer >> will drop the session and the disk state will not change to offline. The >> disk state will only change if the scsi command timer fires and the scsi >> eh runs and fails. In this case the disk state will go to offline. >> >> For the nop timeout case and the scsi eh failing case, the iscsi session >> state will go to failed, so you could check that instead. That value is in >> >> /sys/class/iscsi_session/session%SID/state >> >> >> >>> If the network comes back up, how soon will the disk state go to >>> 'running' ? >> When the iscsi session is dropped due to a nop timeout or the scsi eh >> failing, the initiator will basically poll the network ever couple of >> seconds by trying to reconnect the tcp connection. And so it depends on >> the type of failure. If the initiator is trying to reconnect the tcp >> connection when the network comes up, then we could reconnect right >> away, or if the network layer cannot figure things out the reconnect >> could timeout and then the next try would work, or if the network had >> given us a error right away when we tried the reconnect then it on the >> next reconnect attempt we would be successful. > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to email@example.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~---