Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni ompi/mca/btl/usnic ompi/mca/common/ofacm ompi/mca/mtl/mxm ompi/mca/mtl/psm ompi/mca/pml/base ompi/mca/pml/bfo ompi/proc opal/mca/db/pmi orte/mca/ess/pmi orte/mca/grpcomm/pmi orte/mca/rml/oob

George Bosilca Mon, 19 Aug 2013 18:41:52 -0400 (EDT)

If your offer is between quadratic and non-deterministic, I'll take the former.


I would advocate for a middle-ground solution. Clearly document in the header 
file that the ompi_proc_get_hostname is __not__ safe to be used in all contexts 
as it might exhibit recursive behavior due to communications. Then revert all 
its uses in the context of opal_output, opal_output_verbose and all variants 
back to using "->proc_hostname". We might get a (null) instead of the peer 
name, but this removes the potential loops.

  George.

On Aug 19, 2013, at 23:52 , Nathan Hjelm <[email protected]> wrote:

> It would require a db read from every rank which is what we are trying
> to avoid. This scales quadratic at best on Cray systems.
> 
> -Nathan
> 
> On Mon, Aug 19, 2013 at 02:48:18PM -0700, Ralph Castain wrote:
>> Yeah, I have some concerns about it too...been trying to test it out some 
>> more. Would be good to see just how much that one change makes - maybe 
>> restoring just the hostname wouldn't have that big an impact.
>> 
>> I'm leery of trying to ensure we strip all the opal_output loops if we don't 
>> find the hostname.
>> 
>> On Aug 19, 2013, at 2:41 PM, George Bosilca <[email protected]> wrote:
>> 
>>> As a result of this patch the first decode of a peer host name might happen 
>>> in the middle of a debug message (on the first call to 
>>> ompi_proc_get_hostname). Such a behavior might generate deadlocks based on 
>>> the level of output verbosity, and has significant potential to reintroduce 
>>> the recursive behavior the new state machine was supposed to remove.
>>> 
>>> George.
>>> 
>>> 
>>> On Aug 17, 2013, at 02:49 , [email protected] wrote:
>>> 
>>>> Author: rhc (Ralph Castain)
>>>> Date: 2013-08-16 20:49:18 EDT (Fri, 16 Aug 2013)
>>>> New Revision: 29040
>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29040
>>>> 
>>>> Log:
>>>> When we direct launch an application, we rely on PMI for wireup support. 
>>>> In doing so, we lose the de facto data compression we get from the ORTE 
>>>> modex since we no longer get all the wireup info from every proc in a 
>>>> single blob. Instead, we have to iterate over all the procs, calling 
>>>> PMI_KVS_get for every value we require.
>>>> 
>>>> This creates a really bad scaling behavior. Users have found a nearly 20% 
>>>> launch time differential between mpirun and PMI, with PMI being the slower 
>>>> method. Some of the problem is attributable to poor exchange algorithms in 
>>>> RM's like Slurm and Alps, but we make things worse by calling "get" so 
>>>> many times.
>>>> 
>>>> Nathan (with a tad advice from me) has attempted to alleviate this problem 
>>>> by reducing the number of "get" calls. This required the following changes:
>>>> 
>>>> * upon first request for data, have the OPAL db pmi component fetch and 
>>>> decode *all* the info from a given remote proc. It turned out we weren't 
>>>> caching the info, so we would continually request it and only decode the 
>>>> piece we needed for the immediate request. We now decode all the info and 
>>>> push it into the db hash component for local storage - and then all 
>>>> subsequent retrievals are fulfilled locally
>>>> 
>>>> * reduced the amount of data by eliminating the exchange of the OMPI_ARCH 
>>>> value if heterogeneity is not enabled. This was used solely as a check so 
>>>> we would error out if the system wasn't actually homogeneous, which was 
>>>> fine when we thought there was no cost in doing the check. Unfortunately, 
>>>> at large scale and with direct launch, there is a non-zero cost of making 
>>>> this test. We are open to finding a compromise (perhaps turning the test 
>>>> off if requested?), if people feel strongly about performing the test
>>>> 
>>>> * reduced the amount of RTE data being automatically fetched, and fetched 
>>>> the rest only upon request. In particular, we no longer immediately fetch 
>>>> the hostname (which is only used for error reporting), but instead get it 
>>>> when needed. Likewise for the RML uri as that info is only required for 
>>>> some (not all) environments. In addition, we no longer fetch the locality 
>>>> unless required, relying instead on the PMI clique info to tell us who is 
>>>> on our local node (if additional info is required, the fetch is performed 
>>>> when a modex_recv is issued).
>>>> 
>>>> Again, all this only impacts direct launch - all the info is provided when 
>>>> launched via mpirun as there is no added cost to getting it
>>>> 
>>>> Barring objections, we may move this (plus any required other pieces) to 
>>>> the 1.7 branch once it soaks for an appropriate time.
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to