On 04/14/2011 10:32 PM, Ben Greear wrote:
On 04/14/2011 04:24 PM, Mike Christie wrote:
On 04/14/2011 06:08 PM, Ben Greear wrote:
We make a application that calls the iscsi tools to mount
and iscsi server. We then generate file-IO against the
mounted disks to load-test the iscsi servers...

One customer is testing failover in their iscsi server
and it is causing our system to crash.

I am curious if anyone has any ideas about what the problem
might be?


sd 11:0:0:3: [sdk] Result: hostbyte=DID_TRANSPORT_FAILFAST
driverbyte=DRIVER_OK

If you look at more of the logs do you see a

session recovery timed out after X secs

message before you start to see the messages you posted?

It doesn't seem to be there, but perhaps we just didn't find it.

Here are some additional logs:


Is there any way you can send all of the logs?


connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4304582527, last ping 4304587536, now 4304592544
connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4304582527, last ping 4304587536, now 4304592671
connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4304583024, last ping 4304588032, now 4304593040
connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4304583024, last ping 4304588032, now 4304593040

For the test you are performing I would expect these errors. It basically means we did not get traffic for 5 seconds. So we sent a ping to check the target portal. We did not get a response. From there we end up dropping the connection and trying to relogin.

In the last chunk of logs from iscsid we see that we relogged in ok. But we do not have the timestamps so I am not sure when it happened wrt to the DID_TRANSPORT_FAILFAST errors in the other mail.

Also, we have a another chunk of ping time outs which indicate there were two problems (those ping timeouts come after the set above). And we only have the one chunk of messages from iscsid indicating that it fixed only one problem.


connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4305408823, last ping 4305413824, now 4305418832
connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4305408822, last ping 4305413824, now 4305418959
connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4305408822, last ping 4305413824, now 4305419086
connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4305409171, last ping 4305414176, now 4305419213

Following messages from iscsid:

Apr 14 15:19:26 net-lanf-3 iscsid: Kernel reported iSCSI connection 2:0
error (1011) state (3)
Apr 14 15:19:26 net-lanf-3 iscsid: Kernel reported iSCSI connection 3:0
error (1011) state (3)
Apr 14 15:19:26 net-lanf-3 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Apr 14 15:19:26 net-lanf-3 iscsid: Kernel reported iSCSI connection 4:0
error (1011) state (3)
Apr 14 15:19:56 net-lanf-3 iscsid: connect to 198.18.164.11:3260 failed
(No route to host)
Apr 14 15:19:59 net-lanf-3 iscsid: connection2:0 is operational after
recovery (3 attempts)
Apr 14 15:19:59 net-lanf-3 iscsid: connection3:0 is operational after
recovery (3 attempts)
Apr 14 15:19:59 net-lanf-3 iscsid: connection1:0 is operational after
recovery (3 attempts)
Apr 14 15:19:59 net-lanf-3 iscsid: connection4:0 is operational after
recovery (3 attempts)


The DID_TRANSPORT_FAILFAST means something happened to the connection.
We tried to relogin for node.session.timeo.replacement_timeout
seconds. We couldn't, so
we failed the IO upwards the scsi/block/FS layers.

What do you mean by iscsi server? Is the iscsi server the iscsi target
or does iscsi server mean the server running the iscsi initiator?

And what does failover mean in this context and how were you testing
it? Some sort of cluster failover across iscsi targets? Failover
across portals on the
target? Were you disabling/enabling controllers, starting/stopping
targets? Did you mean to use dm-multipath with iscsi?

I believe they have two 'controllers'. When one goes down, the other
assumes the IP
address. Their current testing that causes this crash is to reboot one
controller then
the other a bit later, all the while attempting to read/write data.


Ok, yeah we have seen this type of setup before and we should be able to handle this situation as long as the other controller assumes the IP and is ready within node.session.timeo.replacement_timeout seconds.

What is the value of

iscsiadm -m node -T yourtarget | grep node.session.timeo.replacement_timeout


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to