Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability" tests.

In particular I would like to be able to reboot the target machine without the initiators to lose data.
Like NFS hard mounts.

If the target goes down:
1) I want the device to be frozen so that applications get stuck while trying to access the device 2) and when the target comes up again, I want the in-flight commands to be re-played back to the target so that no data is lost.

I was able to obtain part 1, by increasing the replacement_timeout to a high enough value.

However it seems I cannot obtain part 2 because there are still errors in the dmesg. I think this is due to the lost inflight commands (my guess.... from what's written in the README).

These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264: ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117: ext4_put_super:719
They look harmful...


Firstly I don't understand why open-iscsi does not requeue inflight commands by itself as soon as it blocks the device for connection lost. It seems the braindead obvious solution to me. Then, if the replacement_timeout expires, all commands (inflight and queued) should be failed together to the above layer. I don't understand why they should get a different treatment.


Secondly, I read in the docs that SCSI commands are retried 5 times.
Ok good! then I don't understand why ext4 still sees data loss. I was doing cycles of
...
stop target service
wait 15 secs
start target service
wait 15 secs
...
(the initiator in the meanwhile is untarring tens of thousands of files from a kernel tar in a forever loop)


In just 15 seconds I cannot believe the scsi commands could really fail 5 times, that would be a 3 seconds timeout, it's too low...

And also when the SCSI layer resubmits the command (second submission) the device is blocked so the command should get stuck in the queue and should stay there until connection is recovered (supposing a high enough replacement_timeout) so the commands should not fail more than once. Then why the errors?

I have even increased the /sys/block/sdX/device/timeout to a very high value. That's the timeout for SCSI isn't it?


I also have increased the following openiscsi timeouts:
    node.session.timeo.replacement_timeout = 480
    node.session.err_timeo.abort_timeout = 60
    node.session.err_timeo.lu_reset_timeout = 80
    node.session.err_timeo.host_reset_timeout = 240

but apparently nothing helps.

Please help

Thank you
PS: SCST target

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to