Re: [ofa-general][PATCH 3/4] SRP fail-over faster

David Dillow Thu, 22 Oct 2009 17:34:14 -0700

On Thu, 2009-10-22 at 20:24 -0400, Vu Pham wrote:
> David Dillow wrote:
> > On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote:
> >> Yes and you can not disable intirely. I'm still looking at 
> >> benefits/advantages to disable it entirely
> >
> > To me, the advantage is I have a perfectly viable backup path to the
> > storage, and can immediately start issuing commands to it rather than
> > waiting for any timeout. On my systems, 1 second can be up to 1500 MB
> > transferred and a _huge_ number of compute cycles. And I expect those
> > numbers to grow.
> >
> You can still do so with these patches applied by using the right device 
> name (ie. /dev/sdXXX)


Not in a multipath situation configured for failover. I have to use the
multipath device, which will then use the appropriate path as
prioritized by ALUA.

> >> I use the user supplied setting for local async event on port error 
> >> where link is broken from host to switch
> >
> > Perhaps that part should be in the patch that adds that support, then?
> >   
> That's patch #4

Sure, and perhaps the part that massages the timeout should be in the
patch that introduces it and actually uses it, no?

> > This makes a certain amount of sense; I was confused by the two
> > unrelated changes in this patch. I'm still not all that happy about a
> > hard-coded 5 seconds, especially with no explanation about the magic
> > number.
> >   
> As I said above, it's not magic at all, it just that certain unknown 
> seconds already passed by, therefore, just pick X seconds to sleep on.

Sorry, this is a common idiom here -- a bare number in source code, with
no explanation as to where it came from or why it was picked, is often
called a "magic number."

I'm saying you should comment on it, either in the commit message or in
a comment in the code. Or better yet, give it a #define and a comment
above that definition that says why you picked it.

In other words, don't make someone who comes along after us have to
search for this mail thread to figure out that the 5 second sleep was
pulled out of thin air.

> >> To really sleep user supplied number of seconds, we need to register 
> >> trap to SM and receiving trap for a node leaving the fabric.
> >> It requires a lot of changes in srp_daemon (registering to trap, passing 
> >> event down to srp driver) and srp driver (handling this event)
> >>     
> >
> > Well, if this were done, then you wouldn't need to sleep at all would
> > you? Just wait for the trap telling you the target rejoined the fabric?
> > Perhaps you'd want a delay before tearing down the target connection,
> > but then that could be part of the user settings above?
> >
> > Not that I'm sure it is worth it, though.
> >   
> If it's done, you still need to sleep target->device_loss_timeout 
> (instead of some unknown seconds + 5) to tear down connection so that 
> dm-multipath can fail-over.

Or I can just start failing requests due to knowing they won't get to
the target so dm-multipath will use the backup path immediately. I can
sleep as long as I want before killing the connection, just in case it
comes back, but my commands will still be going to the other path.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Reply via email to