I discussed the desired behavior we want out of this fail-over code with folks offline, and we came up with a plan.

At the moment, there are two conflicting failure schemes, multiple addresses, and multiple protocols. Also, the multiple protocols case isn't ideal for failure, since both protocols on the same host are active and listening, making a timeout with one protocol automatically switch to the second (often slower) one, without any admin fail-over procedures.

The plan we have is to allow multiple addresses to be specified in an ordered list as before:

Alias hosta mx://hosta:0:3 mx://hostb:0:4 ... tcp://hosta:3334 tcp:// hostb:3335 ....

For the server, a command line option specifies which addresses from this list to listen on.

pvfs2-server fs.conf -a hosta -e "mx://hosta:0:3" -e "tcp://hosta:3334"

For the client, the contact order is based on the addresses in the Alias from left to right, with a few caveats:

1. Once an address fails, the client attempts the next addresses in the list till one succeeds in a round-robin fashion. Once an attempt succeeds, the cursor is reset to the beginning of the list. This allows fail-over to a server to be 'reset' at some point, without getting stuck attempting to contact (and succeeding) on a different protocol.

2. An option to the mount entry specifies which protocols should be "filtered out" of the list. This allows clients to control the secondary endpoints that are attempted from the list. The order remains the same, but for example, nodes that only want to use mx can set a mount option to bmi=mx, and the list of endpoints in the above example just becomes:

Alias hosta mx://hosta:0:3 mx://hostb:0:4 ...

This prevents the behavior where a client would connect over tcp instead of attempting mx addresses until the HA infrastructure has time to do the proper fail-over. In some cases admins may choose to interleave protocol addresses:

Alias hosta mx://hosta:0:3 tcp://hosta:3334 mx://hostb:0:4 tcp:// hostb:3335 ....

The default behavior here would be to fail-over to the tcp address if the mx address failed, which the admin may actually want. With the mount option, the user/admin can then further constrain the list to be only the mx addresses (or tcp), and allow the fail-over to only occur on one protocol. Also, the mount option can specify an ordering to the protocols as well, so that even if the different addresses of the protocols are interleaved in the list, the mx addresses will always be attempted first, then the tcp addresses.

Hopefully this covers all the scenarios we plan to see. At the moment I'm considering the mount option as bmi=<proto1>:<proto2>:...

I'm going to start modifying the patch to get this behavior. Let me know what you think.

Thanks,
-sam


On Nov 6, 2007, at 6:18 PM, Sam Lang wrote:


Here's take2.  Hopefully a little cleaner.

The more I think about the issue with multiple protocols and primary/secondary addresses, the more complicated it gets. I've added an option to the server to specify the index in the set of primary/secondary addresses to listen on, allowing a server to be started and listen on the 3rd set of addresses in the endpoint string. This doesn't fix the problem on the client though, as we don't really want to iterate over both the primary/secondary endpoints and the different protocols. I guess maybe the protocol on the client has to be chosen based on the protocol specified in the mntent.

-sam

<bmi-maddrs-take2.patch>

On Nov 6, 2007, at 2:27 PM, Sam Lang wrote:


On Nov 6, 2007, at 1:41 PM, Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Tue, 06 Nov 2007 10:50 -0600:
The attached patch implements BMI multiple address endpoints that we
talked about some time ago.  To refresh everyone's memory, this
allows a set of addresses to be specified:

tcp://host1:3334/pvfs2-fs,tcp://host2:3335/pvfs2-fs,tcp::// host3:3336/
pvfs2-fs

In the config file for a given storage endpoint.  The BMI code
manages the endpoint, setting the currently used address to the first one in the list. On message failure, the endpoint is transitioned to
point to the next address in the list.  This continues in a round-
robin fashion.

This is good stuff.  I'd like to help review it.  Can you do some
trivial things first to make that easier?

1.
+    struct bmi_endpoint_ref_s *newref;
[..]
+ /* haven't seen any of these addresses before, add a new endpoint */
+    newref = malloc(sizeof(struct bmi_addr_ref_s));

Please change all malloc(sizeof()) and memset(,,sizeof()) to use the
variable name, not its type.  The bug you did above (and other
places) is way too common.  We just have to stop doing that.  Like
this instead:

        newref = malloc(sizeof(*newref));

2.
There's a bunch of stuff that seems out of place.  Can you check
that in or push it to the side so we can concentrate on the core?
179 kB is a big patch.  :)

The bmi-addr.[ch] files are new files, and the critical part of the patch. They obsolete the ref_st calls, so the reference-list. [ch] files have been removed in the patch. That may be adding to the number of lines.


Renaming all the method_ops is a big uninteresting part.

Yeah, I got tired of calling them over and over with the BMI_method_ tagged on the front. I can try to pull that stuff out.


Adding bmi_ in front of everything too, but it's too hard to rip
that out mid-patch now.  Random whitespace fixes too.  Good changes,
just hard to read.

3.
Some trivial bugs.

+    ref->current = ref->current + 1 % ref->count;

Check your precedence table.

+ * (C) 2001 Clemson University and The University of Chicago

New files go back in time.

Can you put some comments above each of the three new structs in
bmi-addr.c?  I keep getting confused on which is the old-style
addr and which is the comma-separated list.  And what the various
"link" and "refs" fields point to.

Sure thing.


I've done some basic testing, but there's still more to do.  The
client IO state machine is a bear, and testing all the cases where
things could failover (requests, flows, acks, etc.) is going to take some more work. I wanted to get the patch out there to allow others
to provide feedback.

Yeah, totally.  But it can be made to work.

What did we decide do with mixed method usage?  The old semantic
was "ib://foo:2345/pvfs2-fs,tcp://foo:2347/pvfs2-fs" means try to
use IB, but if you don't have an IB nic, switch to TCP.  I agree we
decided that was less interesting.  Do we just add docs that say
that this comma is now for multi-pathing?  If people try this
example with the new code, it will flip from IB to TCP at every
timeout.  The old behavior was to stick with the first one where
you had the hardware.  In other words, probably some docs somewhere
should be added to this patch.

Right, I thought we had decided to go the multi-path route. I guess there could be a config option that would set the flipping from round-robin to try-once, giving the old behavior.


On the server side, the ,-separated addresses mean "listen on all
these interfaces".  What do servers do now when they see your
tcp://host1,tcp://host2 example string above?  Looks like they would
fail to listen on anything (host1 can't bind to host2 address?).
This has to be in the fs.conf as the Alias string for each server so
that clients can find the IO servers, not just in the pvfs2tab to
find the config server.

Yeah that formatting isn't going to work. I wanted to keep it simple, but I guess that's not possible. Should we separate fallback addresses with ';' instead?

tcp://host1:3331,ib://host1:3335;tcp://host2:3332,ib://host2:3336

Something like that?
-sam


                -- Pete




_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to