Re: recovering from a ping timeout, gracefully

Michael Christie Mon, 22 Jun 2015 13:51:52 -0700

On Jun 21, 2015, at 8:55 AM, Brian J. Murrell <[email protected]> wrote:


> So, clearly I have some kind of temporary/intermittent issues with my
> network, unfortunately.  :-(  Fortunately they do seem to be
> infrequently intermittent and most of the time things work.  But every
> now and then I well get a spate of this:
> 
> Jun 19 15:08:39 eagle-4.eagle.hpdd.intel.com kernel: connection17:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082655665, last ping 
> 5082660665, now 5082665665
> Jun 19 15:08:39 eagle-4.eagle.hpdd.intel.com kernel: connection17:0: detected 
> conn error (1011)
> Jun 19 15:08:39 eagle-4.eagle.hpdd.intel.com kernel: connection24:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082655834, last ping 
> 5082660834, now 5082665834
> Jun 19 15:08:39 eagle-4.eagle.hpdd.intel.com kernel: connection24:0: detected 
> conn error (1011)
> Jun 19 15:08:40 eagle-4 iscsid: Kernel reported iSCSI connection 17:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> Jun 19 15:08:40 eagle-4 iscsid: Kernel reported iSCSI connection 24:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> Jun 19 15:08:40 eagle-4.eagle.hpdd.intel.com kernel: connection23:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082656608, last ping 
> 5082661608, now 5082666608
> Jun 19 15:08:40 eagle-4.eagle.hpdd.intel.com kernel: connection23:0: detected 
> conn error (1011)
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection19:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082657189, last ping 
> 5082662189, now 5082667189
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection19:0: detected 
> conn error (1011)
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection21:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082657235, last ping 
> 5082662235, now 5082667235
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection21:0: detected 
> conn error (1011)
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection18:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082657253, last ping 
> 5082662253, now 5082667253
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection18:0: detected 
> conn error (1011)
> Jun 19 15:08:41 eagle-4 iscsid: Kernel reported iSCSI connection 23:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection22:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082657666, last ping 
> 5082662666, now 5082667666
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection22:0: detected 
> conn error (1011)
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection20:0: ping 
> timeout of 5 secs expired, recv timeout 5, last rx 5082657674, last ping 
> 5082662674, now 5082667680
> Jun 19 15:08:41 eagle-4.eagle.hpdd.intel.com kernel: connection20:0: detected 
> conn error (1011)
> Jun 19 15:08:42 eagle-4 iscsid: Kernel reported iSCSI connection 19:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> Jun 19 15:08:42 eagle-4 iscsid: Kernel reported iSCSI connection 21:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> Jun 19 15:08:42 eagle-4 iscsid: Kernel reported iSCSI connection 18:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> Jun 19 15:08:42 eagle-4 iscsid: Kernel reported iSCSI connection 22:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> Jun 19 15:08:42 eagle-4 iscsid: Kernel reported iSCSI connection 20:0 error 
> (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
> 
> and then my ISCSI target it offline.  The network will recover very
> shortly thereafter though and I can ping the tgtd server, etc.
> 
> What I wonder is, what is the most graceful way to tell the above
> machine that things are repaired and to consider the target back in
> service?  Currently after the above happens and even after the network
> recovers, accessing the target returns an EIO, despite connectivity
> being restored.  I'm assuming that this error state is persisted until
> an operator can tell ISCSI otherwise.  But how does the operator do
> that?

It should be done automatically.

For the initiator/open-iscsi side of things, when you see the above error,  it 
will drop the connection, then try to reconnect and relogin to the target. 
After the above errors you should see messages from iscsid about how it cannot 
reach the target or it cannot connect to it (depending on the network issue it 
will be slightly different messages). When the network issue is resolved, we 
automatically relogin and allow you to access the target/disks.

When we reconnect and relogin, you should see a message like:

connection1:0 is operational after recovery (8 attempts)

in /var/log/messages from iscsid.

You should then be able to access the disks again.


-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

Re: recovering from a ping timeout, gracefully

Reply via email to