Hi Gilles

We discussed this at the devel conference this morning. The root cause of
the problem is a test in coll/ml that we feel is incorrect - it basically
checks to see if the proc itself is bound, and then assumes that all other
procs are similarly bound. This in fact is never guaranteed to be true as
someone could use the rank_file method to specify that some procs are to be
left unbound, while others are to be bound to specified cpus.

Nathan has looked at that check before and believes it isn't necessary. All
coll/ml really needs to know is that the two procs share the same node, and
the current locality algorithm will provide that information. We have asked
him to "fix" the coll/ml selection logic to resolve that situation.

After then discussing the various locality definitions, it was our feeling
that the current definition is probably the better one unless you have a
reason for changing it other than coll/ml. If so, we'd be happy to revisit
the proposal.

Make sense?
Ralph



On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> WHAT: semantic change of opal_hwloc_base_get_relative_locality
>
> WHY:  make is closer to what coll/ml expects.
>
>       Currently, opal_hwloc_base_get_relative_locality means "at what
> level do these procs share cpus"
>       however, coll/ml is using it as "at what level are these procs
> commonly bound".
>
>       it is important to note that if a task is bound to all the available
> cpus, locality should
>       be set to OPAL_PROC_ON_NODE only.
>       /* e.g. on a single socket Sandy Bridge system, use
> OPAL_PROC_ON_NODE instead of OPAL_PROC_ON_L3CACHE */
>
>       This has been initially discussed in the devel mailing list
>       http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>
>       as advised by Ralph, i browsed the source code looking for how the
> (ompi_proc_t *)->proc_flags is used.
>       so far, it is mainly used to figure out wether the proc is on the
> same node or not.
>
>       notable exceptions are :
>        a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c :
> OPAL_PROC_ON_LOCAL_SOCKET
>        b) ompi/mca/coll/fca/coll_fca_module.c and
> oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS
>
>       about a) the new definition fixes a hang in coll/ml
>       about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only
> found OMPI_PROC_FLAG_LOCAL in v1.3 */
>       so this macro can be simply removed and replaced with
> OPAL_PROC_ON_LOCAL_NODE
>
>       at this stage, i cannot find any objection not to do the described
> change.
>       please report if any and/or feel free to comment.
>
> WHERE: see the two attached patches
>
> TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, June
> 24-26.
>          The RFC will become final only after the meeting.
>          /* Ralph already added this topic to the agenda */
>
> Thanks
>
> Gilles
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/15046.php
>

Reply via email to