To me, (c) is a non-starter. I think whenever possible we should be
automatically doing the right thing. The user should not need to have
any idea how things work inside the library.
Between options (a) and (b), I don't really care.
(b) would be great if we had a mca component dependency system which has
been much talked about. But without such a system it gets messy.
(a) has the advantage of making sure there is no problems and allowing
the 2 systems to interact very nicely together, but it also might add a
large burden to a component writer.
On a related, but slightly different topic, one thing that has always
bothered me about the grpcomm/routed implementation is that it is not
self contained. There is logic for routing algorithms outside of the
components (for example, in orte/orted/orted_comm.c). So, if there are
any overhauls planned I definitely think this needs to be cleaned up.
Thanks,
Tim
Ralph H Castain wrote:
II. Interaction between the ROUTED and GRPCOMM frameworks
When we initially developed these two frameworks within the RTE, we
envisioned them to operate totally independently of each other. Thus, the
grpcomm collectives provide algorithms such as a binomial "xcast" that uses
the daemons to scalably send messages across the system.
However, we recently realized that the efficacy of the current grpcomm
algorithms directly hinge on the daemons being fully connected - which we
were recently told may not be the case as other people introduce different
ROUTED components. For example, using the binomial algorithm in grpcomm's
xcast while having a ring topology selected in ROUTED would likely result in
terrible performance.
This raises the following questions:
(a) should the GRPCOMM and ROUTED frameworks be consolidated to ensure that
the group collectives algorithms properly "match" the communication
topology?
(b) should we automatically select the grpcomm/routed pairings based on some
internal logic?
(c) should we leave this "as-is" and the user is responsible for making
intelligent choices (and for detecting when the performance is bad due to
this mismatch)?
(d) other suggestions?
Ralph
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel