On Aug 20, 12:46 am, Mike Christie <[EMAIL PROTECTED]> wrote:
> v42bis wrote:
> > I have changed the timeouts in iscsid.conf as directed in the iSCSI
> > Root section 8.2 of the README so that in case the open-iscsi
> > initiator loses connection or has other communications problems with
> > my OpenSolaris target then the open-iscsi initiator will wait for up
> > to 24 hours before it fails to the SCSI layer:
>
> > node.session.timeo.replacement_timeout = 86400
> > node.conn[0].timeo.login_timeout = 15
> > node.conn[0].timeo.logout_timeout = 15
> > node.conn[0].timeo.noop_out_interval = 0
> > node.conn[0].timeo.noop_out_timeout = 0
>
> > My OpenSolaris target recently core dumped and came back online in
> > about 5 minutes. By that time, all of my ext3 partitions mounted over
> > iscsi had aborted their journals. Shouldn't iscsi wait for 24 hours
> > before I see any failures on my SCSI layer affecting my ext3
> > partitions?
>
> It should have. Do you have the logs? Do you see something about the
> replacement or recovery timeout timing out. It would have the correct
> 86400 value, but when you look at the log it would say that it failed a
> lot quicker like the 5 minutes you mention. If this happens you may be
> hitting a bug where the kernel cannot support long timeouts and
> basically what is happening is the kernel's timer is rolling over and
> not caching it self right or maybe we are not supposed to be setting
> that high. We are still investigating to see who is at fault.


Thank for the reply, Mike.

The iscsi connections failed about 1m13s after my iscsi target went
down (timestamps that follow are synced from same ntp master, however
clock skew may account for a few seconds difference [1m45sec seems
very conspicuous - a multiplier of default 15sec timers?]). The target
went down at Aug 19 13:33:33.

>From /var/log/messages of one of my open-iscsi clients with two
sessions active and ext3 filesystems mounted from each at the time of
target failure:

Aug 19 13:34:46 ak1-vz2 kernel:  connection2:0: iscsi: detected conn
error (1011)
Aug 19 13:35:47 ak1-vz2 kernel:  connection1:0: iscsi: detected conn
error (1011)
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: SCSI error: return code =
0x00020000
Aug 19 13:36:38 ak1-vz2 kernel: end_request: I/O error, dev sdc,
sector 4063233
Aug 19 13:36:38 ak1-vz2 kernel: lost page write due to I/O error on
sdc1
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: SCSI error: return code =
0x00020000
Aug 19 13:36:38 ak1-vz2 kernel: end_request: I/O error, dev sdc,
sector 4157905
Aug 19 13:36:38 ak1-vz2 kernel: lost page write due to I/O error on
sdc1
Aug 19 13:36:39 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
0 host_failed 0
Aug 19 13:36:39 ak1-vz2 kernel: lost page write due to I/O error on
sdc1
Aug 19 13:36:39 ak1-vz2 last message repeated 2 times
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: SCSI error: return code =
0x00020000
Aug 19 13:36:41 ak1-vz2 kernel: end_request: I/O error, dev sdd,
sector 126137
Aug 19 13:36:41 ak1-vz2 kernel: lost page write due to I/O error on
sdd1
Aug 19 13:36:41 ak1-vz2 last message repeated 4 times
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: SCSI error: return code =
0x00020000
Aug 19 13:36:41 ak1-vz2 kernel: end_request: I/O error, dev sdd,
sector 33214121
Aug 19 13:36:42 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
0 host_failed 0
Aug 19 13:36:42 ak1-vz2 kernel: sd 7:0:0:0: SCSI error: return code =
0x00010000
Aug 19 13:36:42 ak1-vz2 kernel: end_request: I/O error, dev sdd,
sector 33216097
Aug 19 13:36:42 ak1-vz2 kernel: __journal_remove_journal_head: freeing
b_committed_data
Aug 19 13:36:45 ak1-vz2 kernel: reading directory #1245317 offset 0

I have seen the following logs in the past when my iscsi target
machine fails over to a multipath/bonded NIC within about 30 seconds:

Jul 29 08:20:09 ak1-vz3.aktiom.net iscsid: received iferror -38
Jul 29 08:20:09 ak1-vz3.aktiom.net iscsid: connection8:0 is
operational now

The above did not affect normal operation of my open-iscsi initiators.

This is the only debug info I have. I didn't install open-iscsi with
debug enabled.

--
Dave

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to