Sam,
Just a random thought..
Just like our distribution code is pluggable, perhaps the
multi-pathing code can also be (eventually?) moved into a plugin-based
architecture..?
Also I think, instead of using the secondary paths only for failover
scenarios, multi-pathing plugins could define to choose low latency
networks for MD traffic and high/BW networks for data or something to
that effect. Instead of hard-coding policies such as these,
we could probably provide the framework and possibly a default plugin
(failover and failback etc) and leave the possibility of 3rd
party/external researchers to write their own plugin..

I havent reviewed your patches yet. I plan to do that tomorrow to get
a better feel for the interfaces and how we can separate out the
policies from the framework mechanisms..So right now, what I have said
above is complete vapor-ware and may make no sense :)
I guess I was thinking of multi-pathing plugins in the SAN world
provided by storage array vendors (e.g..PowerPath for EMC)
and was intrigued by the possibility of providing multiple paths
across different networks/interconnects ..
Thanks,
Murali


On Nov 7, 2007 2:21 PM, Sam Lang <[EMAIL PROTECTED]> wrote:
>
> I discussed the desired behavior we want out of this fail-over code
> with folks offline, and we came up with a plan.
>
> At the moment, there are two conflicting failure schemes, multiple
> addresses, and multiple protocols.  Also, the multiple protocols case
> isn't ideal for failure, since both protocols on the same host are
> active and listening, making a timeout with one protocol
> automatically switch to the second (often slower) one, without any
> admin fail-over procedures.
>
> The plan we have is to allow multiple addresses to be specified in an
> ordered list as before:
>
> Alias hosta mx://hosta:0:3 mx://hostb:0:4 ... tcp://hosta:3334 tcp://
> hostb:3335 ....
>
> For the server, a command line option specifies which addresses from
> this list to listen on.
>
> pvfs2-server fs.conf -a hosta -e "mx://hosta:0:3" -e "tcp://hosta:3334"
>
> For the client, the contact order is based on the addresses in the
> Alias from left to right, with a few caveats:
>
> 1.  Once an address fails, the client attempts the next addresses in
> the list till one succeeds in a round-robin fashion.  Once an attempt
> succeeds, the cursor is reset to the beginning of the list.  This
> allows fail-over to a server to be 'reset' at some point, without
> getting stuck attempting to contact (and succeeding) on a different
> protocol.
>
> 2.  An option to the mount entry specifies which protocols should be
> "filtered out" of the list.  This allows clients to control the
> secondary endpoints that are attempted from the list.  The order
> remains the same, but for example, nodes that only want to use mx can
> set a mount option to bmi=mx, and the list of endpoints in the above
> example just becomes:
>
> Alias hosta mx://hosta:0:3 mx://hostb:0:4 ...
>
> This prevents the behavior where a client would connect over tcp
> instead of attempting mx addresses until the HA infrastructure has
> time to do the proper fail-over.  In some cases admins may choose to
> interleave protocol addresses:
>
> Alias hosta mx://hosta:0:3 tcp://hosta:3334 mx://hostb:0:4 tcp://
> hostb:3335 ....
>
> The default behavior here would be to fail-over to the tcp address if
> the mx address failed, which the admin may actually want.  With the
> mount option, the user/admin can then further constrain the list to
> be only the mx addresses (or tcp), and allow the fail-over to only
> occur on one protocol.  Also, the mount option can specify an
> ordering to the protocols as well, so that even if the different
> addresses of the protocols are interleaved in the list, the mx
> addresses will always be attempted first, then the tcp addresses.
>
> Hopefully this covers all the scenarios we plan to see.  At the
> moment I'm considering the mount option as bmi=<proto1>:<proto2>:...
>
> I'm going to start modifying the patch to get this behavior.  Let me
> know what you think.
>
> Thanks,
> -sam
>
>
>
> On Nov 6, 2007, at 6:18 PM, Sam Lang wrote:
>
> >
> > Here's take2.  Hopefully a little cleaner.
> >
> > The more I think about the issue with multiple protocols and
> > primary/secondary addresses, the more complicated it gets.  I've
> > added an option to the server to specify the index in the set of
> > primary/secondary addresses to listen on, allowing a server to be
> > started and listen on the 3rd set of addresses in the endpoint
> > string.  This doesn't fix the problem on the client though, as we
> > don't really want to iterate over both the primary/secondary
> > endpoints and the different protocols.  I guess maybe the protocol
> > on the client has to be chosen based on the protocol specified in
> > the mntent.
> >
> > -sam
> >
> > <bmi-maddrs-take2.patch>
>
> >
> > On Nov 6, 2007, at 2:27 PM, Sam Lang wrote:
> >
> >>
> >> On Nov 6, 2007, at 1:41 PM, Pete Wyckoff wrote:
> >>
> >>> [EMAIL PROTECTED] wrote on Tue, 06 Nov 2007 10:50 -0600:
> >>>> The attached patch implements BMI multiple address endpoints
> >>>> that we
> >>>> talked about some time ago.  To refresh everyone's memory, this
> >>>> allows a set of addresses to be specified:
> >>>>
> >>>> tcp://host1:3334/pvfs2-fs,tcp://host2:3335/pvfs2-fs,tcp:://
> >>>> host3:3336/
> >>>> pvfs2-fs
> >>>>
> >>>> In the config file for a given storage endpoint.  The BMI code
> >>>> manages the endpoint, setting the currently used address to the
> >>>> first
> >>>> one in the list.  On message failure, the endpoint is
> >>>> transitioned to
> >>>> point to the next address in the list.  This continues in a round-
> >>>> robin fashion.
> >>>
> >>> This is good stuff.  I'd like to help review it.  Can you do some
> >>> trivial things first to make that easier?
> >>>
> >>> 1.
> >>>> +    struct bmi_endpoint_ref_s *newref;
> >>> [..]
> >>>> +    /* haven't seen any of these addresses before, add a new
> >>>> endpoint */
> >>>> +    newref = malloc(sizeof(struct bmi_addr_ref_s));
> >>>
> >>> Please change all malloc(sizeof()) and memset(,,sizeof()) to use the
> >>> variable name, not its type.  The bug you did above (and other
> >>> places) is way too common.  We just have to stop doing that.  Like
> >>> this instead:
> >>>
> >>>     newref = malloc(sizeof(*newref));
> >>>
> >>> 2.
> >>> There's a bunch of stuff that seems out of place.  Can you check
> >>> that in or push it to the side so we can concentrate on the core?
> >>> 179 kB is a big patch.  :)
> >>
> >> The bmi-addr.[ch] files are new files, and the critical part of
> >> the patch.  They obsolete the ref_st calls, so the reference-list.
> >> [ch] files have been removed in the patch.  That may be adding to
> >> the number of lines.
> >>
> >>>
> >>> Renaming all the method_ops is a big uninteresting part.
> >>
> >> Yeah, I got tired of calling them over and over with the
> >> BMI_method_ tagged on the front.  I can try to pull that stuff out.
> >>
> >>>
> >>> Adding bmi_ in front of everything too, but it's too hard to rip
> >>> that out mid-patch now.  Random whitespace fixes too.  Good changes,
> >>> just hard to read.
> >>>
> >>> 3.
> >>> Some trivial bugs.
> >>>
> >>>> +    ref->current = ref->current + 1 % ref->count;
> >>>
> >>> Check your precedence table.
> >>>
> >>>> + * (C) 2001 Clemson University and The University of Chicago
> >>>
> >>> New files go back in time.
> >>>
> >>> Can you put some comments above each of the three new structs in
> >>> bmi-addr.c?  I keep getting confused on which is the old-style
> >>> addr and which is the comma-separated list.  And what the various
> >>> "link" and "refs" fields point to.
> >>
> >> Sure thing.
> >>
> >>>
> >>>> I've done some basic testing, but there's still more to do.  The
> >>>> client IO state machine is a bear, and testing all the cases where
> >>>> things could failover (requests, flows, acks, etc.) is going to
> >>>> take
> >>>> some more work.  I wanted to get the patch out there to allow
> >>>> others
> >>>> to provide feedback.
> >>>
> >>> Yeah, totally.  But it can be made to work.
> >>>
> >>> What did we decide do with mixed method usage?  The old semantic
> >>> was "ib://foo:2345/pvfs2-fs,tcp://foo:2347/pvfs2-fs" means try to
> >>> use IB, but if you don't have an IB nic, switch to TCP.  I agree we
> >>> decided that was less interesting.  Do we just add docs that say
> >>> that this comma is now for multi-pathing?  If people try this
> >>> example with the new code, it will flip from IB to TCP at every
> >>> timeout.  The old behavior was to stick with the first one where
> >>> you had the hardware.  In other words, probably some docs somewhere
> >>> should be added to this patch.
> >>
> >> Right, I thought we had decided to go the multi-path route.  I
> >> guess there could be a config option that would set the flipping
> >> from round-robin to try-once, giving the old behavior.
> >>
> >>>
> >>> On the server side, the ,-separated addresses mean "listen on all
> >>> these interfaces".  What do servers do now when they see your
> >>> tcp://host1,tcp://host2 example string above?  Looks like they would
> >>> fail to listen on anything (host1 can't bind to host2 address?).
> >>> This has to be in the fs.conf as the Alias string for each server so
> >>> that clients can find the IO servers, not just in the pvfs2tab to
> >>> find the config server.
> >>
> >> Yeah that formatting isn't going to work.  I wanted to keep it
> >> simple, but I guess that's not possible.  Should we separate
> >> fallback addresses with ';' instead?
> >>
> >> tcp://host1:3331,ib://host1:3335;tcp://host2:3332,ib://host2:3336
> >>
> >> Something like that?
> >> -sam
> >>
> >>>
> >>>             -- Pete
> >>>
> >>
> >
>
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to