All true - and yet, becoming more common in larger clusters :-/ On Mar 24, 2014, at 7:42 AM, Kenneth A. Lloyd <kenneth.ll...@wattsys.com> wrote:
> Vasily, > > The problem you've identified of differing kernel versions is exacerbated by > also computing across hybrid, heterogeneous hardware architectures (i.e. > SMP& NUMA, different streaming processor architectures, or different shared > memory architectures). > > ========================== > Kenneth A. Lloyd, Jr. > CEO - Director, Systems Science > Watt Systems Technologies Inc. > Albuquerque, NM USA > www.wattsys.com > kenneth.ll...@wattsys.com > > This e-mail is covered by the Electronic Communications Privacy Act, 18 > U.S.C. 2510-2521, and is intended only for the addressee named above. It may > contain privileged or confidential information. If you are not the addressee > you must not copy, distribute, disclose or use any of the information in > this transmission. If you received it in error, please delete it and > immediately notify the sender. > > > > -----Original Message----- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Vasily Filipov > Sent: Monday, March 24, 2014 7:44 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] autoconf warnings: openib BTL > > Actually I think if you build your job with one kernel version and run it on > nodes that have another version so rdmacm will be the smallest your problem. > Anyway, here is the revision fixes the issue. > > ------------------------------------------------------------------------ > r31194 | vasily | 2014-03-24 15:36:04 +0200 (Mon, 24 Mar 2014) | 3 lines > > BTL/OPENIB: remove AC_RUN_IFELSE from configure and check AF_IB support by > lib rdmacm during component_init. > > > ------------------------------------------------------------------------ > > Thank you, > Vasily. > > On 13-Mar-14 15:44, Ralph Castain wrote: >> I think the critical point is this one: >> >>> To be clear: whether AF_IB works or not is a determination to make on the > machines on which you *run* -- NOT on the machine on which you *build*. >> Many of our users compile on the frontend node of their cluster, which > doesn't even have an IB NIC installed (they only have the libraries present > so it can compile). You need to test this at run time to ensure you are on a > machine where someone actually is able to run rdmacm. >> >> >> On Mar 13, 2014, at 5:53 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: >> >>> On Mar 13, 2014, at 4:59 AM, Mike Dubman <mi...@dev.mellanox.co.il> > wrote: >>> >>>>>>> Right? If so, I don't see why you need the AC_TRY_RUN -- if RDMACM > is easily detectable as to which way it is compiled (because it has, for > example, different fields), then AC_CHECK_DECLS should be enough, right? >>>> RDMACM API has different implementation requirements for its providers: > tcp, af_ib (different structs/fields should be used/passed. different > APIs/hooks should be called for bring-up). >>> Yes, this was said before. Which is why I don't understand why > AC_CHECK_DECLS isn't enough -- it's a compile-time check, right? >>> >>> Let me get this straight: >>> >>> 1. AF_IB may or may not be present. >>> 2. If AF_IB is present, it may or may not work (i.e., support for AF_IB > may or may not work in the kernel). >>> 3. If AF_IB is present, you can only compile with the AF_IB fields and > methods. >>> 4. If AF_IB is not present, you can only compile with the non-AF_IB > fields and methods. >>> >>> I think #2 is not relevant for configure -- only #1, #3, and #4 are > relevant. So you should have code something like this: >>> >>> #if HAVE_DECL_AF_IB >>> ret = do_the_stuff_with_af_ib(...); >>> if (OMPI_SUCCESS != ret) { >>> opal_show_help(...AF_IB doesn't work...); >>> return ret; >>> } >>> #else >>> ret = do_the_stuff_without_af_ib(...); >>> if (OMPI_SUCCESS != ret) { >>> opal_show_help(...non-AF_IB doesn't work...); >>> return ret; >>> } >>> #endif >>> >>> To be clear: whether AF_IB works or not is a determination to make on the > machines on which you *run* -- NOT on the machine on which you *build*. >>> >>> This is one of the key reasons that OMPI prefers run-time detection for > run-time characteristics over configure-time detection for run-time > characteristics (because you may run OMPI on different machines than where > you built OMPI). >>> >>>> Currently, the RDMACM provider can be selected at compile time only and > mpirun becomes incompatible to other RDMACM providers. >>> What does mpirun have to do with this? We're talking about the openib > BTL, right? >>> >>>> AC_TRY_RUN is a protection that selected provider will be able to > run,otherwise no fallback to other provider will be available for user at > runtime. >>> I can't parse this statement...? >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14342.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14343.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14381.php > > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2014.0.4336 / Virus Database: 3722/7238 - Release Date: 03/23/14 > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14382.php