On 03/11/11 17:58, Mike Christie wrote:
>
> Do you mean iscsi_tcp reconnects before the replacement_timeout? If so
> iser should be doing this too. It is a bug if it is not coming back
> until after replcement_timeout seconds if the problem has been fixed
> within that many seconds.
>   

Thanks for the explanation, I'll try with more debugging output.

Actually, everything is fine if the connection loss is fixed within the
replacement_timeout < 120s. But if the connection loss lasts longer lots
of problems occur with the file system inside the VM. We can't control
parameters inside the customer VMs, so we can't tweak FS parameters there.

What we want is that everything is blocked until we replace the failed
component (e.g. an IB switch) without data corruption inside the VMs.
With Ethernet (iscsi_tcp) we have that behaviour.
But with RDMA the VM looses much data. If we increase the
replacement_timeout, then the guest kernel notices after 120s blocked IO
timeout, discards the IOs and one of the mentioned three cases occurs.

I guess that some piece of software has to resend the pending IO in case
of an IO error when the connection becomes available again and the guest
kernel may not notice. I think that QEMU/KVM does something like that.

Is there a difference in the IO errors which are reported to the
application depending on the transport (iser or tcp)?

Btw: IPoIB and iscsi_tcp is even worse. There, the VM can't be shut down
any more (IO error) and the QEMU/KVM process can't be killed by SIGKILL
immediately. The defunct process remains in the system for a long time.

So, is this some kind of QEMU/KVM or open-iscsi issue?

Best regards,

Sebastian

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to