Re: [Pvfs2-developers] bmi multiple address endpoints

Sam Lang Wed, 07 Nov 2007 14:21:44 -0800

I discussed the desired behavior we want out of this fail-over codewith folks offline, and we came up with a plan.

At the moment, there are two conflicting failure schemes, multipleaddresses, and multiple protocols. Also, the multiple protocols caseisn't ideal for failure, since both protocols on the same host areactive and listening, making a timeout with one protocolautomatically switch to the second (often slower) one, without anyadmin fail-over procedures.

The plan we have is to allow multiple addresses to be specified in anordered list as before:

Alias hosta mx://hosta:0:3 mx://hostb:0:4 ... tcp://hosta:3334 tcp://hostb:3335 ....

For the server, a command line option specifies which addresses fromthis list to listen on.


pvfs2-server fs.conf -a hosta -e "mx://hosta:0:3" -e "tcp://hosta:3334"

For the client, the contact order is based on the addresses in theAlias from left to right, with a few caveats:

1. Once an address fails, the client attempts the next addresses inthe list till one succeeds in a round-robin fashion. Once an attemptsucceeds, the cursor is reset to the beginning of the list. Thisallows fail-over to a server to be 'reset' at some point, withoutgetting stuck attempting to contact (and succeeding) on a differentprotocol.

2. An option to the mount entry specifies which protocols should be"filtered out" of the list. This allows clients to control thesecondary endpoints that are attempted from the list. The orderremains the same, but for example, nodes that only want to use mx canset a mount option to bmi=mx, and the list of endpoints in the aboveexample just becomes:


Alias hosta mx://hosta:0:3 mx://hostb:0:4 ...

This prevents the behavior where a client would connect over tcpinstead of attempting mx addresses until the HA infrastructure hastime to do the proper fail-over. In some cases admins may choose tointerleave protocol addresses:

Alias hosta mx://hosta:0:3 tcp://hosta:3334 mx://hostb:0:4 tcp://hostb:3335 ....

The default behavior here would be to fail-over to the tcp address ifthe mx address failed, which the admin may actually want. With themount option, the user/admin can then further constrain the list tobe only the mx addresses (or tcp), and allow the fail-over to onlyoccur on one protocol. Also, the mount option can specify anordering to the protocols as well, so that even if the differentaddresses of the protocols are interleaved in the list, the mxaddresses will always be attempted first, then the tcp addresses.

Hopefully this covers all the scenarios we plan to see. At themoment I'm considering the mount option as bmi=<proto1>:<proto2>:...

I'm going to start modifying the patch to get this behavior. Let meknow what you think.


Thanks,
-sam


On Nov 6, 2007, at 6:18 PM, Sam Lang wrote:

Here's take2.  Hopefully a little cleaner.
The more I think about the issue with multiple protocols andprimary/secondary addresses, the more complicated it gets. I'veadded an option to the server to specify the index in the set ofprimary/secondary addresses to listen on, allowing a server to bestarted and listen on the 3rd set of addresses in the endpointstring. This doesn't fix the problem on the client though, as wedon't really want to iterate over both the primary/secondaryendpoints and the different protocols. I guess maybe the protocolon the client has to be chosen based on the protocol specified inthe mntent.
-sam

<bmi-maddrs-take2.patch>

On Nov 6, 2007, at 2:27 PM, Sam Lang wrote:
On Nov 6, 2007, at 1:41 PM, Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Tue, 06 Nov 2007 10:50 -0600:
The attached patch implements BMI multiple address endpointsthat we
talked about some time ago.  To refresh everyone's memory, this
allows a set of addresses to be specified:
tcp://host1:3334/pvfs2-fs,tcp://host2:3335/pvfs2-fs,tcp:://host3:3336/
pvfs2-fs

In the config file for a given storage endpoint.  The BMI code
manages the endpoint, setting the currently used address to thefirstone in the list. On message failure, the endpoint istransitioned to
point to the next address in the list.  This continues in a round-
robin fashion.
This is good stuff.  I'd like to help review it.  Can you do some
trivial things first to make that easier?

1.
+    struct bmi_endpoint_ref_s *newref;
[..]
+ /* haven't seen any of these addresses before, add a newendpoint */
+    newref = malloc(sizeof(struct bmi_addr_ref_s));
Please change all malloc(sizeof()) and memset(,,sizeof()) to use the
variable name, not its type.  The bug you did above (and other
places) is way too common.  We just have to stop doing that.  Like
this instead:

        newref = malloc(sizeof(*newref));

2.
There's a bunch of stuff that seems out of place.  Can you check
that in or push it to the side so we can concentrate on the core?
179 kB is a big patch.  :)
The bmi-addr.[ch] files are new files, and the critical part ofthe patch. They obsolete the ref_st calls, so the reference-list.[ch] files have been removed in the patch. That may be adding tothe number of lines.
Renaming all the method_ops is a big uninteresting part.
Yeah, I got tired of calling them over and over with theBMI_method_ tagged on the front. I can try to pull that stuff out.
Adding bmi_ in front of everything too, but it's too hard to rip
that out mid-patch now.  Random whitespace fixes too.  Good changes,
just hard to read.

3.
Some trivial bugs.
+    ref->current = ref->current + 1 % ref->count;
Check your precedence table.
+ * (C) 2001 Clemson University and The University of Chicago
New files go back in time.

Can you put some comments above each of the three new structs in
bmi-addr.c?  I keep getting confused on which is the old-style
addr and which is the comma-separated list.  And what the various
"link" and "refs" fields point to.
Sure thing.
I've done some basic testing, but there's still more to do.  The
client IO state machine is a bear, and testing all the cases where
things could failover (requests, flows, acks, etc.) is going totakesome more work. I wanted to get the patch out there to allowothers
to provide feedback.
Yeah, totally.  But it can be made to work.

What did we decide do with mixed method usage?  The old semantic
was "ib://foo:2345/pvfs2-fs,tcp://foo:2347/pvfs2-fs" means try to
use IB, but if you don't have an IB nic, switch to TCP.  I agree we
decided that was less interesting.  Do we just add docs that say
that this comma is now for multi-pathing?  If people try this
example with the new code, it will flip from IB to TCP at every
timeout.  The old behavior was to stick with the first one where
you had the hardware.  In other words, probably some docs somewhere
should be added to this patch.
Right, I thought we had decided to go the multi-path route. Iguess there could be a config option that would set the flippingfrom round-robin to try-once, giving the old behavior.
On the server side, the ,-separated addresses mean "listen on all
these interfaces".  What do servers do now when they see your
tcp://host1,tcp://host2 example string above?  Looks like they would
fail to listen on anything (host1 can't bind to host2 address?).
This has to be in the fs.conf as the Alias string for each server so
that clients can find the IO servers, not just in the pvfs2tab to
find the config server.
Yeah that formatting isn't going to work. I wanted to keep itsimple, but I guess that's not possible. Should we separatefallback addresses with ';' instead?
tcp://host1:3331,ib://host1:3335;tcp://host2:3332,ib://host2:3336

Something like that?
-sam
                -- Pete


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] bmi multiple address endpoints

Reply via email to