> -----Original Message----- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: Tuesday, January 08, 2008 4:32 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] SDP support for OPEN-MPI > > On Jan 8, 2008, at 7:45 AM, Lenny Verkhovsky wrote: > > >> Hence, if HAVE_DECL_AF_INET_SDP==1 and using AF_INET_SDP fails to > >> that > >> peer, it might be desirable to try to fail over to using > >> AF_INET_something_else. I'm still technically on vacation :-), so I > >> didn't look *too* closely at your patch, but I think you're doing > >> that > >> (failing over if AF_INET_SDP doesn't work because of EAFNOSUPPORT), > >> which is good. > > This is actually not implemented yet. > > Supporting failing over requires opening AF_INET sockets in addition > > to > > SDP sockets, this can cause a problem in large clusters. > > What I meant was try to open an SDP socket. If it fails because SDP > is not supported / available to that peer, then open a regular > socket. So you should still always have only 1 socket open to a peer > (not 2). Yes, but since the listener side doesn't know on which socket to expect a message it will need both sockets to be opened.
> > > If one of the machine is not supporting SDP user will get an error. > > Well, that's one way to go, but it's certainly less friendly. It > means that the entire MPI job has to support SDP -- including mpirun. > What about clusters that do not have IB on the head node? > They can use OOB over IP sockets and BTL on SDP, it should work. > >> Perhaps a more general approach would be to [perhaps additionally] > >> provide an MCA param to allow the user to specify the AF_* value? > >> (AF_INET_SDP is a standardized value, right? I.e., will it be the > >> same on all Linux variants [and someday Solaris]?) > > I didn't find any standard on it, it seems to be "randomly" selected > > since the originally it was 26 and changed to 27 due to conflict with > > kernel's defines. > > This might make an even stronger case for having an MCA param for it > -- if the AF_INET_SDP value is so broken that it's effectively random, > it may be necessary to override it on some platforms (especially in > light of binary OMPI and OFED distributions that may not match). > If we talking about passing AF_INET_SDP value only then 1. Passing this value as mca parameter will not make any changes to the SDP code. 2. Hopefully in the future AF_INET_SDP value can be gotten from the libc, And the value will be configured automatically. 3. If we are talking about AF_INET value in general ( IPv4, IPv6, SDP) Then by making it constant with mca parameter we are limiting ourselves for one protocol only without being able to failover or using different protocols for different needs ( i.e. SDP for OOB and IPv4 for BTL ) > >> Patrick's got a good point: is there a reason not to do this? > >> (LD_PRELOAD and the like) Is it problematic with the remote orted's? > > Yes, it's problematic with remote orted's and it not really > > transparent > > as you might think. > > Since we can't pass environments' variables to the orted's during > > runtime > > I think this depends on your environment. If you're not using rsh > (which you shouldn't be for a large cluster, which is where SDP would > matter most, right?), the resource manager typically copies the > environment out to the cluster nodes. So an LD_PRELOAD value should > be set for the orteds as well. > > I agree that it's problematic for rsh, but that might also be solvable > (with some limits; there's only so many characters that we can pass on > the command line -- we did investigate having a wrapper to the orted > at one point to accept environment variables and then launch the > orted, but this was so problematic / klunky that we abandoned the idea). > Using LD_PRELOAD will not allow us to use SDP and IP separately, i.e. SDP for OOB and IP for a BTL. > > we must preload sdp library to each remote environment ( i.e. > > bashrc ) This will cause all applications to use SDP instead of > > AF_INET. > > Which means you can't choose specific protocol for specific > > application, > > either you are using SDP or AF_INET for all. > > SDP also can be loaded with appropriate /usr/local/ofed/etc/ > > libsdp.conf > > configuration but a simple user have no access to it usually. > > (http://www.cisco.com/univercd/cc/td/doc/product/svbu/ofed/ofed_1_1/ofed > > _ug/sdp.htm#wp952927) > > > >> Andrew's got a point point here, too -- accelerating the TCP BTL with > >> SDP seems kinda pointless. I'm guessing that you did it because it > >> was just about the same work as was done in the TCP OOB (for which we > >> have no corresponding verbs interface). Is that right? > > Indeed. But it also seems that SDP has lower overhead than VERBS in > > some > > cases. > > Are you referring to the fact that the avail(%) column is lower for > verbs than SDP/IPoIB? That seems like a pretty weird metric for such > small message counts. What exactly does 77.5% of 0 bytes mean? > > My $0.02 is that the other columns are more compelling. :-) > > > Tests with Sandia's overlapping benchmark > > http://www.cs.sandia.gov/smb/overhead.html#mozTocId316713 > > > > VERBS results > > msgsize iterations iter_t work_t overhead base_t > > avail(%) > > 0 1000 16.892 15.309 1.583 7.029 > > 77.5 > > 2 1000 16.852 15.332 1.520 7.144 > > 78.7 > > 4 1000 16.932 15.312 1.620 7.128 > > 77.3 > > 8 1000 16.985 15.319 1.666 7.182 > > 76.8 > > 16 1000 16.886 15.297 1.589 7.219 > > 78.0 > > 32 1000 16.988 15.311 1.677 7.251 > > 76.9 > > 64 1000 16.944 15.299 1.645 7.457 > > 77.9 > > > > SDP results > > 0 1000 134.902 128.089 6.813 54.691 > > 87.5 > > 2 1000 135.064 128.196 6.868 55.283 > > 87.6 > > 4 1000 135.031 128.356 6.675 55.039 > > 87.9 > > 8 1000 130.460 125.908 4.552 52.010 > > 91.2 > > 16 1000 135.432 128.694 6.738 55.615 > > 87.9 > > 32 1000 135.228 128.494 6.734 55.627 > > 87.9 > > 64 1000 135.470 128.540 6.930 56.583 > > 87.8 > > > > IPoIB results > > 0 1000 252.953 247.053 5.900 119.977 > > 95.1 > > 2 1000 253.336 247.285 6.051 121.573 > > 95.0 > > 4 1000 254.147 247.041 7.106 122.110 > > 94.2 > > 8 1000 254.613 248.011 6.602 121.840 > > 94.6 > > 16 1000 255.662 247.952 7.710 124.738 > > 93.8 > > 32 1000 255.569 248.057 7.512 127.095 > > 94.1 > > 64 1000 255.867 248.308 7.559 132.858 > > 94.3 > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel