Sam, Just a random thought.. Just like our distribution code is pluggable, perhaps the multi-pathing code can also be (eventually?) moved into a plugin-based architecture..? Also I think, instead of using the secondary paths only for failover scenarios, multi-pathing plugins could define to choose low latency networks for MD traffic and high/BW networks for data or something to that effect. Instead of hard-coding policies such as these, we could probably provide the framework and possibly a default plugin (failover and failback etc) and leave the possibility of 3rd party/external researchers to write their own plugin..
I havent reviewed your patches yet. I plan to do that tomorrow to get a better feel for the interfaces and how we can separate out the policies from the framework mechanisms..So right now, what I have said above is complete vapor-ware and may make no sense :) I guess I was thinking of multi-pathing plugins in the SAN world provided by storage array vendors (e.g..PowerPath for EMC) and was intrigued by the possibility of providing multiple paths across different networks/interconnects .. Thanks, Murali On Nov 7, 2007 2:21 PM, Sam Lang <[EMAIL PROTECTED]> wrote: > > I discussed the desired behavior we want out of this fail-over code > with folks offline, and we came up with a plan. > > At the moment, there are two conflicting failure schemes, multiple > addresses, and multiple protocols. Also, the multiple protocols case > isn't ideal for failure, since both protocols on the same host are > active and listening, making a timeout with one protocol > automatically switch to the second (often slower) one, without any > admin fail-over procedures. > > The plan we have is to allow multiple addresses to be specified in an > ordered list as before: > > Alias hosta mx://hosta:0:3 mx://hostb:0:4 ... tcp://hosta:3334 tcp:// > hostb:3335 .... > > For the server, a command line option specifies which addresses from > this list to listen on. > > pvfs2-server fs.conf -a hosta -e "mx://hosta:0:3" -e "tcp://hosta:3334" > > For the client, the contact order is based on the addresses in the > Alias from left to right, with a few caveats: > > 1. Once an address fails, the client attempts the next addresses in > the list till one succeeds in a round-robin fashion. Once an attempt > succeeds, the cursor is reset to the beginning of the list. This > allows fail-over to a server to be 'reset' at some point, without > getting stuck attempting to contact (and succeeding) on a different > protocol. > > 2. An option to the mount entry specifies which protocols should be > "filtered out" of the list. This allows clients to control the > secondary endpoints that are attempted from the list. The order > remains the same, but for example, nodes that only want to use mx can > set a mount option to bmi=mx, and the list of endpoints in the above > example just becomes: > > Alias hosta mx://hosta:0:3 mx://hostb:0:4 ... > > This prevents the behavior where a client would connect over tcp > instead of attempting mx addresses until the HA infrastructure has > time to do the proper fail-over. In some cases admins may choose to > interleave protocol addresses: > > Alias hosta mx://hosta:0:3 tcp://hosta:3334 mx://hostb:0:4 tcp:// > hostb:3335 .... > > The default behavior here would be to fail-over to the tcp address if > the mx address failed, which the admin may actually want. With the > mount option, the user/admin can then further constrain the list to > be only the mx addresses (or tcp), and allow the fail-over to only > occur on one protocol. Also, the mount option can specify an > ordering to the protocols as well, so that even if the different > addresses of the protocols are interleaved in the list, the mx > addresses will always be attempted first, then the tcp addresses. > > Hopefully this covers all the scenarios we plan to see. At the > moment I'm considering the mount option as bmi=<proto1>:<proto2>:... > > I'm going to start modifying the patch to get this behavior. Let me > know what you think. > > Thanks, > -sam > > > > On Nov 6, 2007, at 6:18 PM, Sam Lang wrote: > > > > > Here's take2. Hopefully a little cleaner. > > > > The more I think about the issue with multiple protocols and > > primary/secondary addresses, the more complicated it gets. I've > > added an option to the server to specify the index in the set of > > primary/secondary addresses to listen on, allowing a server to be > > started and listen on the 3rd set of addresses in the endpoint > > string. This doesn't fix the problem on the client though, as we > > don't really want to iterate over both the primary/secondary > > endpoints and the different protocols. I guess maybe the protocol > > on the client has to be chosen based on the protocol specified in > > the mntent. > > > > -sam > > > > <bmi-maddrs-take2.patch> > > > > > On Nov 6, 2007, at 2:27 PM, Sam Lang wrote: > > > >> > >> On Nov 6, 2007, at 1:41 PM, Pete Wyckoff wrote: > >> > >>> [EMAIL PROTECTED] wrote on Tue, 06 Nov 2007 10:50 -0600: > >>>> The attached patch implements BMI multiple address endpoints > >>>> that we > >>>> talked about some time ago. To refresh everyone's memory, this > >>>> allows a set of addresses to be specified: > >>>> > >>>> tcp://host1:3334/pvfs2-fs,tcp://host2:3335/pvfs2-fs,tcp::// > >>>> host3:3336/ > >>>> pvfs2-fs > >>>> > >>>> In the config file for a given storage endpoint. The BMI code > >>>> manages the endpoint, setting the currently used address to the > >>>> first > >>>> one in the list. On message failure, the endpoint is > >>>> transitioned to > >>>> point to the next address in the list. This continues in a round- > >>>> robin fashion. > >>> > >>> This is good stuff. I'd like to help review it. Can you do some > >>> trivial things first to make that easier? > >>> > >>> 1. > >>>> + struct bmi_endpoint_ref_s *newref; > >>> [..] > >>>> + /* haven't seen any of these addresses before, add a new > >>>> endpoint */ > >>>> + newref = malloc(sizeof(struct bmi_addr_ref_s)); > >>> > >>> Please change all malloc(sizeof()) and memset(,,sizeof()) to use the > >>> variable name, not its type. The bug you did above (and other > >>> places) is way too common. We just have to stop doing that. Like > >>> this instead: > >>> > >>> newref = malloc(sizeof(*newref)); > >>> > >>> 2. > >>> There's a bunch of stuff that seems out of place. Can you check > >>> that in or push it to the side so we can concentrate on the core? > >>> 179 kB is a big patch. :) > >> > >> The bmi-addr.[ch] files are new files, and the critical part of > >> the patch. They obsolete the ref_st calls, so the reference-list. > >> [ch] files have been removed in the patch. That may be adding to > >> the number of lines. > >> > >>> > >>> Renaming all the method_ops is a big uninteresting part. > >> > >> Yeah, I got tired of calling them over and over with the > >> BMI_method_ tagged on the front. I can try to pull that stuff out. > >> > >>> > >>> Adding bmi_ in front of everything too, but it's too hard to rip > >>> that out mid-patch now. Random whitespace fixes too. Good changes, > >>> just hard to read. > >>> > >>> 3. > >>> Some trivial bugs. > >>> > >>>> + ref->current = ref->current + 1 % ref->count; > >>> > >>> Check your precedence table. > >>> > >>>> + * (C) 2001 Clemson University and The University of Chicago > >>> > >>> New files go back in time. > >>> > >>> Can you put some comments above each of the three new structs in > >>> bmi-addr.c? I keep getting confused on which is the old-style > >>> addr and which is the comma-separated list. And what the various > >>> "link" and "refs" fields point to. > >> > >> Sure thing. > >> > >>> > >>>> I've done some basic testing, but there's still more to do. The > >>>> client IO state machine is a bear, and testing all the cases where > >>>> things could failover (requests, flows, acks, etc.) is going to > >>>> take > >>>> some more work. I wanted to get the patch out there to allow > >>>> others > >>>> to provide feedback. > >>> > >>> Yeah, totally. But it can be made to work. > >>> > >>> What did we decide do with mixed method usage? The old semantic > >>> was "ib://foo:2345/pvfs2-fs,tcp://foo:2347/pvfs2-fs" means try to > >>> use IB, but if you don't have an IB nic, switch to TCP. I agree we > >>> decided that was less interesting. Do we just add docs that say > >>> that this comma is now for multi-pathing? If people try this > >>> example with the new code, it will flip from IB to TCP at every > >>> timeout. The old behavior was to stick with the first one where > >>> you had the hardware. In other words, probably some docs somewhere > >>> should be added to this patch. > >> > >> Right, I thought we had decided to go the multi-path route. I > >> guess there could be a config option that would set the flipping > >> from round-robin to try-once, giving the old behavior. > >> > >>> > >>> On the server side, the ,-separated addresses mean "listen on all > >>> these interfaces". What do servers do now when they see your > >>> tcp://host1,tcp://host2 example string above? Looks like they would > >>> fail to listen on anything (host1 can't bind to host2 address?). > >>> This has to be in the fs.conf as the Alias string for each server so > >>> that clients can find the IO servers, not just in the pvfs2tab to > >>> find the config server. > >> > >> Yeah that formatting isn't going to work. I wanted to keep it > >> simple, but I guess that's not possible. Should we separate > >> fallback addresses with ';' instead? > >> > >> tcp://host1:3331,ib://host1:3335;tcp://host2:3332,ib://host2:3336 > >> > >> Something like that? > >> -sam > >> > >>> > >>> -- Pete > >>> > >> > > > > _______________________________________________ > Pvfs2-developers mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
