All true - and yet, becoming more common in larger clusters :-/

On Mar 24, 2014, at 7:42 AM, Kenneth A. Lloyd <kenneth.ll...@wattsys.com> wrote:

> Vasily,
> 
> The problem you've identified of differing kernel versions is exacerbated by
> also computing across hybrid,  heterogeneous hardware architectures (i.e.
> SMP& NUMA, different streaming processor architectures, or different shared
> memory architectures).
> 
> ==========================
> Kenneth A. Lloyd, Jr.
> CEO - Director, Systems Science
> Watt Systems Technologies Inc.
> Albuquerque, NM USA
> www.wattsys.com
> kenneth.ll...@wattsys.com
> 
> This e-mail is covered by the Electronic Communications Privacy Act, 18
> U.S.C. 2510-2521, and is intended only for the addressee named above. It may
> contain privileged or confidential information. If you are not the addressee
> you must not copy, distribute, disclose or use any of the information in
> this transmission. If you received it in error, please delete it and
> immediately notify the sender.
> 
> 
> 
> -----Original Message-----
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Vasily Filipov
> Sent: Monday, March 24, 2014 7:44 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] autoconf warnings: openib BTL
> 
> Actually I think if you build your job with one kernel version and run it on
> nodes that have another version so rdmacm will be the smallest your problem.
> Anyway, here is the revision fixes the issue.
> 
> ------------------------------------------------------------------------
> r31194 | vasily | 2014-03-24 15:36:04 +0200 (Mon, 24 Mar 2014) | 3 lines
> 
> BTL/OPENIB: remove AC_RUN_IFELSE from configure and check AF_IB support by
> lib rdmacm during component_init.
> 
> 
> ------------------------------------------------------------------------
> 
> Thank you,
> Vasily.
> 
> On 13-Mar-14 15:44, Ralph Castain wrote:
>> I think the critical point is this one:
>> 
>>> To be clear: whether AF_IB works or not is a determination to make on the
> machines on which you *run* -- NOT on the machine on which you *build*.
>> Many of our users compile on the frontend node of their cluster, which
> doesn't even have an IB NIC installed (they only have the libraries present
> so it can compile). You need to test this at run time to ensure you are on a
> machine where someone actually is able to run rdmacm.
>> 
>> 
>> On Mar 13, 2014, at 5:53 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
>> 
>>> On Mar 13, 2014, at 4:59 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>>> 
>>>>>>> Right?  If so, I don't see why you need the AC_TRY_RUN -- if RDMACM
> is easily detectable as to which way it is compiled (because it has, for
> example, different fields), then AC_CHECK_DECLS should be enough, right?
>>>> RDMACM API has different implementation requirements for its providers:
> tcp, af_ib (different structs/fields should be used/passed. different
> APIs/hooks should be called for bring-up).
>>> Yes, this was said before.  Which is why I don't understand why
> AC_CHECK_DECLS isn't enough -- it's a compile-time check, right?
>>> 
>>> Let me get this straight:
>>> 
>>> 1. AF_IB may or may not be present.
>>> 2. If AF_IB is present, it may or may not work (i.e., support for AF_IB
> may or may not work in the kernel).
>>> 3. If AF_IB is present, you can only compile with the AF_IB fields and
> methods.
>>> 4. If AF_IB is not present, you can only compile with the non-AF_IB
> fields and methods.
>>> 
>>> I think #2 is not relevant for configure -- only #1, #3, and #4 are
> relevant.  So you should have code something like this:
>>> 
>>> #if HAVE_DECL_AF_IB
>>>    ret = do_the_stuff_with_af_ib(...);
>>>    if (OMPI_SUCCESS != ret) {
>>>        opal_show_help(...AF_IB doesn't work...);
>>>        return ret;
>>>    }
>>> #else
>>>    ret = do_the_stuff_without_af_ib(...);
>>>    if (OMPI_SUCCESS != ret) {
>>>        opal_show_help(...non-AF_IB doesn't work...);
>>>        return ret;
>>>    }
>>> #endif
>>> 
>>> To be clear: whether AF_IB works or not is a determination to make on the
> machines on which you *run* -- NOT on the machine on which you *build*.
>>> 
>>> This is one of the key reasons that OMPI prefers run-time detection for
> run-time characteristics over configure-time detection for run-time
> characteristics (because you may run OMPI on different machines than where
> you built OMPI).
>>> 
>>>> Currently, the RDMACM provider can be selected at compile time only and
> mpirun becomes incompatible to other RDMACM providers.
>>> What does mpirun have to do with this?  We're talking about the openib
> BTL, right?
>>> 
>>>> AC_TRY_RUN is a protection that selected provider will be able to
> run,otherwise no fallback to other provider will be available for user at
> runtime.
>>> I can't parse this statement...?
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14342.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14343.php
>> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14381.php
> 
> 
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2014.0.4336 / Virus Database: 3722/7238 - Release Date: 03/23/14
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14382.php

Reply via email to