Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability" tests.
In particular I would like to be able to reboot the target machine
without the initiators to lose data.
Like NFS hard mounts.
If the target goes down:
1) I want the device to be frozen so that applications get stuck while
trying to access the device
2) and when the target comes up again, I want the in-flight commands to
be re-played back to the target so that no data is lost.
I was able to obtain part 1, by increasing the replacement_timeout to a
high enough value.
However it seems I cannot obtain part 2 because there are still errors
in the dmesg. I think this is due to the lost inflight commands (my
guess.... from what's written in the README).
These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:
ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117: ext4_put_super:719
They look harmful...
Firstly I don't understand why open-iscsi does not requeue inflight
commands by itself as soon as it blocks the device for connection lost.
It seems the braindead obvious solution to me. Then, if the
replacement_timeout expires, all commands (inflight and queued) should
be failed together to the above layer. I don't understand why they
should get a different treatment.
Secondly, I read in the docs that SCSI commands are retried 5 times.
Ok good! then I don't understand why ext4 still sees data loss. I was
doing cycles of
...
stop target service
wait 15 secs
start target service
wait 15 secs
...
(the initiator in the meanwhile is untarring tens of thousands of files
from a kernel tar in a forever loop)
In just 15 seconds I cannot believe the scsi commands could really fail
5 times, that would be a 3 seconds timeout, it's too low...
And also when the SCSI layer resubmits the command (second submission)
the device is blocked so the command should get stuck in the queue and
should stay there until connection is recovered (supposing a high enough
replacement_timeout) so the commands should not fail more than once.
Then why the errors?
I have even increased the /sys/block/sdX/device/timeout to a very high
value. That's the timeout for SCSI isn't it?
I also have increased the following openiscsi timeouts:
node.session.timeo.replacement_timeout = 480
node.session.err_timeo.abort_timeout = 60
node.session.err_timeo.lu_reset_timeout = 80
node.session.err_timeo.host_reset_timeout = 240
but apparently nothing helps.
Please help
Thank you
PS: SCST target
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/open-iscsi?hl=en.