On Thu, Mar 15, 2018 at 12:48 AM, Mike Christie <mchri...@redhat.com> wrote:

> ...
>
> It looks like there is a bug.
>
> 1. A regression was added when I stopped killing the iscsi connection
> when the lock is taken away from us to handle a failback bug where it
> was causing ping ponging. That combined with #2 will cause the bug.
>
> 2. I did not anticipate the type of sleeps above where they are injected
> any old place in the kernel. For example, if a command had really got
> stuck on the network then the nop timer would fire which forces the
> iscsi thread's recv() to fail and that submitting thread to exit. Or we
> should handle the delay-request-in-tcmu-runner.diff issue ok, because we
> wait for those commands. However, we could just get rescheduled due to
> hitting a preemption point and we might not be rescheduled for longer
> than failover timeout seconds. For this it could just be some buggy code
> that gets run on all the cpus for more than failover timeout seconds
> then recovers, and we would hit the bug in your patch above.
>
> The 2 attached patches fix the issues for me on linux. Note that it only
> works on linux right now and it only works with 2 nodes. It probably
> also works for ESX/windows, but I need to reconfig some timers.
>
> Apply ceph-iscsi-config-explicit-standby.patch to ceph-iscsi-config and
> tcmu-runner-use-explicit.patch to tcmu-runner.
>
>
>
Mike, thank you for patches, they seem to work. There is an issue, but not
related to data corruption: if the second path (gateway) is not available
and I restart tcmu-runner on the first gateway, all subsequent i/o hangs
for long because tcmu-runner is in UNLOCKED state and initiator doesn't
resend explicit ALUA activation request for long while (190s).

Can you please also clarify how explicit ALUA (with these patches applied)
is immune to a situation when there are some stale requests sitting in
kernel queues by the moment tcmu-runner handles tcmu_explicit_transition()
--> tcmu_acquire_dev_lock(). Does it mean that all requests are strictly
ordered and initiator will never send us read/wrtie requests until we
complete that explicit ALUA activation request?

Thanks,
Maxim
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to