NFS hard semantics wanted: how to?

torn5 Wed, 22 Dec 2010 09:21:09 -0800

Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability" tests.

In particular I would like to be able to reboot the target machinewithout the initiators to lose data.

Like NFS hard mounts.

If the target goes down:

1) I want the device to be frozen so that applications get stuck whiletrying to access the device2) and when the target comes up again, I want the in-flight commands tobe re-played back to the target so that no data is lost.

I was able to obtain part 1, by increasing the replacement_timeout to ahigh enough value.

However it seems I cannot obtain part 2 because there are still errorsin the dmesg. I think this is due to the lost inflight commands (myguess.... from what's written in the README).


These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10

[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:ext4_remount:3755

[31291.360015] EXT4-fs (sdd1): last error at 1292976117: ext4_put_super:719
They look harmful...

Firstly I don't understand why open-iscsi does not requeue inflightcommands by itself as soon as it blocks the device for connection lost.It seems the braindead obvious solution to me. Then, if thereplacement_timeout expires, all commands (inflight and queued) shouldbe failed together to the above layer. I don't understand why theyshould get a different treatment.



Secondly, I read in the docs that SCSI commands are retried 5 times.

Ok good! then I don't understand why ext4 still sees data loss. I wasdoing cycles of

...
stop target service
wait 15 secs
start target service
wait 15 secs
...

(the initiator in the meanwhile is untarring tens of thousands of filesfrom a kernel tar in a forever loop)

In just 15 seconds I cannot believe the scsi commands could really fail5 times, that would be a 3 seconds timeout, it's too low...

And also when the SCSI layer resubmits the command (second submission)the device is blocked so the command should get stuck in the queue andshould stay there until connection is recovered (supposing a high enoughreplacement_timeout) so the commands should not fail more than once.Then why the errors?

I have even increased the /sys/block/sdX/device/timeout to a very highvalue. That's the timeout for SCSI isn't it?



I also have increased the following openiscsi timeouts:
    node.session.timeo.replacement_timeout = 480
    node.session.err_timeo.abort_timeout = 60
    node.session.err_timeo.lu_reset_timeout = 80
    node.session.err_timeo.host_reset_timeout = 240

but apparently nothing helps.

Please help

Thank you
PS: SCST target

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

NFS hard semantics wanted: how to?

Reply via email to