On Tuesday, April 21, 2020 at 11:56:23 PM UTC-7, Uli wrote:
>
> >>> The Lee-Man <[email protected]> schrieb am 21.04.2020 um 20:44 
> in 
> Nachricht 
> <618_1587494664_5E9F3F08_618_445_1_7f583720-8a84-4872-8d1a-5cd284295c22@googlegr
>  
>
> ups.com>: 
> > On Tuesday, April 21, 2020 at 12:31:24 AM UTC-7, Gionatan Danti wrote: 
> >> 
> >> [reposting, as the previous one seems to be lost] 
> >> 
> >> Hi all, 
> >> I have a question regarding udev events when using iscsi disks. 
> >> 
> >> By using "udevadm monitor" I can see that events are generated when I 
> >> login and logout from an iscsi portal/resource, creating/destroying the 
> >> relative links under /dev/ 
> >> 
> >> However, I can not see anything when the remote machine simple 
> >> dies/reboots/disconnects: while "dmesg" shows the iscsi timeout 
> expiring, I 
> >> don't see anything about a removed disk (and the links under /dev/ 
> remains 
> >> unaltered, indeed). At the same time, when the remote machine and disk 
> >> become available again, no reconnection events happen. 
> >> 
> > 
> > Because of the design of iSCSI, there is no way for the initiator to 
> know 
> > the server has gone away. The only time an initiator might figure this 
> out 
> > is when it tries to communicate with the target. 
>
> My knowlege of the SCSI stack is quite poor, but I think the last 
> revisions of parallel SCSI (like Ultra 320 (or was it 160?)) had a concept 
> of "domain validation". AFAIK the leatter meant measuring the quality of 
> the wires, adjusting the transfer speed. 
> While basically SCSI assumes "the bus" won't go away magically, a future 
> iSCSI standard might contain  regular "bus checks" to trigger recovery 
> actions if the "bus" (network transport connection) seems to be gone. 
>
> > 
> > This assumes we are not using some sort of directory service, like iSNS, 
> > which can send asynchronous notifications. But even then, the iSNS 
> server 
> > would have to somehow know that the target went down. If the target 
> > crashed, that might be difficult to ascertain. 
>
> To be picky: If the traget went down (like a classical failing SCSI disk), 
> it could issue some attention message, but when the transport went down, no 
> such message can be received. So I think there's a difference between 
> "target down" (device not present, device fails to respond) and "bus down" 
> (no communication possible any more). In the second case no assumptions can 
> be made about the health of the traget device. 
>
> > 
> > So in the absence of some asynchronous notification, the initiator only 
> > knows the target is not responding if it tries to talk to that target. 
> > 
> > Normally iscsid defaults to sending periodic NO-OPs to the target every 
> 5 
> > seconds. So if the target goes away, the initiator usually notices, even 
> if 
> > no regular I/O is occurring. 
>
> So the target went away, or the bus went down? 
>

The initiator does not know the difference. As you know, there are dozens 
of things (conservatively) that can go wrong, which is why I say the disk 
"goes away". It could be sleeping. It could be dead. The cable could be 
unplugged. The system could be rebooting. The switch could be down. The 
ACLs could have changed (which is how I simulate a target going away). 

>
> > 
> > But this is where the error recovery gets tricky, because iscsi tries to 
> > handle "lossy" connections. What if the server will be right back? Maybe 
> > it's rebooting? Maybe the cable will be plugged back in? So iscsi keeps 
> > trying to reconnect. As a matter of fact, if you stop iscsid and restart 
> > it, it sees the failed connection and retries it -- forever, by default. 
> I 
> > actually added a configuration parameter called reopen_max, that can 
> limit 
> > the number of retries. But there was pushback on changing the default 
> value 
> > from 0, which is "retry forever". 
> > 
> > So what exactly do you think the system should do when a connection 
> "goes 
> > away"? How long does it have to be gone to be considered gone for good? 
> If 
> > the target comes back "later" should it get the same disc name? Should 
> we 
> > retry, and if so how much before we give up? I'm interested in your 
> views, 
> > since it seems like a non-trivial problem to me. 
>
> IMHO a "bus down" is a critical event affecting _all_ devices on that bus, 
> not just a single target. Well, it might be some extra noise if those other 
> targets have no I/O outstanding, but it's better to know that the bus is 
> down before initiating a transfer rather than concluding seconds later that 
> the target seems unreachable for some reasons unknown. 
>

There are 3 error handling levels built into the iSCSI protocol. I think 
you'll need to change/augment the protocol to change this. They are 
ERL=[0|1|2]. Error level 0 is the default, and the only one supported by 
open-iscsi. That just means we end the connection reconnect. ERL=1 adds 
handling digest error handling, and ERL=2 adds session recovery on top of 
that, i.e. try to recover the session before disconnecting and reconnecting.

It is up to the transport (usually TCP/IP) to tell us of transport errors. 
At the open-iscsi level, the transport should either "just work", or it 
should fail and tell us it failed.

But perhaps I'm being redundant and you know all this.

>
> > 
> >> 
> >> I can read here that, years ago, a patch was in progress to give better 
> >> integration with udev when a device disconnects/reconnects. Did the 
> patch 
> >> got merged? Or does the one I described above remain the expected 
> behavior? 
> >> Can be changed? 
> >> 
> > 
> > So you're saying as soon as a bad connection is detected (perhaps by a 
> > NOOP), the device should go away? 
>
> Maybe the state should be similar to a device being in power-save mode: 
> It's not accessible right now, but should be woke up ASAP. See my earlier 
> comparison to NFS hard-mounts... 
>

I think the current code works well enough when the target goes away for a 
"short" period of time, but again it depends on how it goes away. Not all 
disappearances are equal, though we really can't tell them apart very well. 

>
> Regards, 
> Ulrich 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/03d325e8-a7b7-44bc-a31c-419ba09b1890%40googlegroups.com.

Reply via email to