Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Vu Pham Fri, 23 Oct 2009 09:51:02 -0700

David Dillow wrote:

On Thu, 2009-10-22 at 20:24 -0400, Vu Pham wrote:

David Dillow wrote:
On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote:
Yes and you can not disable intirely. I'm still looking atbenefits/advantages to disable it entirely
To me, the advantage is I have a perfectly viable backup path to the
storage, and can immediately start issuing commands to it rather than
waiting for any timeout. On my systems, 1 second can be up to 1500 MB
transferred and a _huge_ number of compute cycles. And I expect those
numbers to grow.
You can still do so with these patches applied by using the right devicename (ie. /dev/sdXXX)


Not in a multipath situation configured for failover. I have to use the
multipath device, which will then use the appropriate path as
prioritized by ALUA.

I don't know much about multipath in ALUA mode.

How would multipath driver (in ALUA mode) to switch path? (ie. basing onwhat criteria?)Can you switch path manually in user mode (while there are commandsstucked in current active path)?

Without this patch, all outstanding I/Os have to go thru error recoverybefore being returned with error code so that dm-multipath fail-over.

I use the user supplied setting for local async event on port errorwhere link is broken from host to switch
Perhaps that part should be in the patch that adds that support, then?
That's patch #4
Sure, and perhaps the part that massages the timeout should be in the
patch that introduces it and actually uses it, no?


I will look at it and rework the patch.

This makes a certain amount of sense; I was confused by the two
unrelated changes in this patch. I'm still not all that happy about a
hard-coded 5 seconds, especially with no explanation about the magic
number.
As I said above, it's not magic at all, it just that certain unknownseconds already passed by, therefore, just pick X seconds to sleep on.


Sorry, this is a common idiom here -- a bare number in source code, with
no explanation as to where it came from or why it was picked, is often
called a "magic number."

I'm saying you should comment on it, either in the commit message or in
a comment in the code. Or better yet, give it a #define and a comment
above that definition that says why you picked it.

In other words, don't make someone who comes along after us have to
search for this mail thread to figure out that the 5 second sleep was
pulled out of thin air.

Understood.

To really sleep user supplied number of seconds, we need to registertrap to SM and receiving trap for a node leaving the fabric.It requires a lot of changes in srp_daemon (registering to trap, passingevent down to srp driver) and srp driver (handling this event)
Well, if this were done, then you wouldn't need to sleep at all would
you? Just wait for the trap telling you the target rejoined the fabric?
Perhaps you'd want a delay before tearing down the target connection,
but then that could be part of the user settings above?

Not that I'm sure it is worth it, though.
If it's done, you still need to sleep target->device_loss_timeout(instead of some unknown seconds + 5) to tear down connection so thatdm-multipath can fail-over.
Or I can just start failing requests due to knowing they won't get to
the target so dm-multipath will use the backup path immediately. I can
sleep as long as I want before killing the connection, just in case it
comes back, but my commands will still be going to the other path.

If you want to failing requests right away, you can just setdevice_loss_timeout=1, others don't want dm-multipath to switch pathright away. That's a whole idea of these patches that I submitted

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Reply via email to