Il giorno martedì 21 aprile 2020 20:44:22 UTC+2, The Lee-Man ha scritto: > > > Because of the design of iSCSI, there is no way for the initiator to know > the server has gone away. The only time an initiator might figure this out > is when it tries to communicate with the target. > > This assumes we are not using some sort of directory service, like iSNS, > which can send asynchronous notifications. But even then, the iSNS server > would have to somehow know that the target went down. If the target > crashed, that might be difficult to ascertain. > > So in the absence of some asynchronous notification, the initiator only > knows the target is not responding if it tries to talk to that target. > > Normally iscsid defaults to sending periodic NO-OPs to the target every 5 > seconds. So if the target goes away, the initiator usually notices, even if > no regular I/O is occurring. >
True. > > But this is where the error recovery gets tricky, because iscsi tries to > handle "lossy" connections. What if the server will be right back? Maybe > it's rebooting? Maybe the cable will be plugged back in? So iscsi keeps > trying to reconnect. As a matter of fact, if you stop iscsid and restart > it, it sees the failed connection and retries it -- forever, by default. I > actually added a configuration parameter called reopen_max, that can limit > the number of retries. But there was pushback on changing the default value > from 0, which is "retry forever". > > So what exactly do you think the system should do when a connection "goes > away"? How long does it have to be gone to be considered gone for good? If > the target comes back "later" should it get the same disc name? Should we > retry, and if so how much before we give up? I'm interested in your views, > since it seems like a non-trivial problem to me. > Well, for short disconnections the re-try approach is surely the better one. But I naively assumed that a longer disconnection, as described by the node.session.timeo.replacement_timeout parameter, would tear down the device with a corresponding udev event. Udev should have no problem assigning the device a sensible persistent name, right? > > So you're saying as soon as a bad connection is detected (perhaps by a > NOOP), the device should go away? > I would say that the device should go away not a the first NOOP failing, but when the replacement_timeout (or another sensible timeout) expires. This open the door to another question: from iscsid.conf <https://github.com/open-iscsi/open-iscsi/blob/master/etc/iscsid.conf#L99> and README <https://github.com/open-iscsi/open-iscsi/blob/master/README#L1476> files I (wrongly?) understand that replacement_timeout come into play only when the SCSI EH is running, while in the other cases different timeouts as node.session.err_timeo.lu_reset_timeout and node.session.err_timeo.tgt_reset_timeout should affect the (dis)connection. However, in all my tests, I only saw replacement_timeout being honored, still I did not catch a single running instance of SCSI EH via the proposed command iscsiadm -m session -P 3 What I am missing? Thanks. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/open-iscsi/67349dca-9647-4dbd-affc-ded6e8f01ee9%40googlegroups.com.
