Greetings Mike, Hannes, and all;

So, and Jerome and Myself have been pushing VHACS (please see towards
production, we have begin to run into a particular issue while during
the 'vhacs cluster -I' (eg: cluster initilization) routine when we had a
bunch of VHACS server and client clouds active, and need to bring them
all down.  Before explaining the case, here is a little background:

DRBD_TARGET: LIO-Target + DRBD export Cluster RA (Server side VHACS)
ISCSI_MOUNT: Open/iSCSI + Ext3 Mount Cluster RA (Client side VHACS)

Here is the scenario using 'node.session.timeo.replacement_timeout =
120':

We have been experiencing a case where sometimes the DRBD_TARGET RA for
a particular VHACS cloud gets shutdown BEFORE the ISCSI_MOUNT RA.  This
of course means that all of the outstanding I/Os from Open/iSCSI with
regard to the mounted filesystem from the DRBD_TARGET that has gone way
need to be failed back to SCSI with a DID_ERROR status.

The problem is that the failure of the outstanding I/Os does not seem to
be occuring in all cases.  In particular, a iscsiadm --logout I believe
is getting issued, and said logout request failing/timing out because
DRBD_TARGET has been released.  It is at this point where umount for the
ext3 mount and/or sync hangs indefinately.  When the problem occurs, it
looks like this from the kernel ring buffer:

iscsi_deallocate_extra_thread_sets:285: ***OPS*** Stopped 1 thread set(s) (2 
total threads).
iscsi_deallocate_extra_thread_sets:285: ***OPS*** Stopped 2 thread set(s) (4 
total threads).
session10: iscsi: session recovery timed out after 120 secs
sd 51:0:0:0: scsi: Device offlined - not ready after error recovery
sd 51:0:0:0: [sdg] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdg, sector 0
Buffer I/O error on device sdg, logical block 0
lost page write due to I/O error on sdg

I should mention that we are not doing any I/O to said iSCSI LUN via
Open/iSCSI other than the filesystem metadata for ext3 umount and
SYNCHRONIZE_CACHE CDB during struct scsi_device deregistration.  From
experience with Core-iSCSI, I know the logout path is tricky wrt
exceptions (I spent months on it to handle all cases with Immediate and
Non Immediate Logout, as well as doing logouts on the fly from the same
connection in MC/S and different connections in MC/S :-)

So the question is:

I) When a ISCSI_INIT_LOGOUT_REQ is not returned with a
ISCSI_TARGET_LOGOUT_RSP and replacement_timeout fires, are all
outstanding I/Os for that particular session being completed with an
non-recoveryable exception..?  Has anyone ever run into this case and/or
tested it..?

Not being exquisitely fimilar with Open/iSCSI source code, if either of
you gents could have a quick look and see if this is indeed broken and
hopefully provide a simple fix (I will take care of drinks at the Kernel
Summit :-), or if not could point me in the right direction it would be
much apperciated.

--nab







--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to