I'm not sure I would call (a) "dumb", but I would agree it isn't a desirable option. ;-)
The issue isn't with the current two routed components. The issue arose because additional routed components are about to be committed to the system. None of those added components are fully connected - i.e., each daemon only has sparse connections to its peers. Hence, the current grpcomm collectives will cause performance issues. Re-writing those collectives to be independent of sparse vs. fully connected schemes is a non-trivial exercise. Do I hear a volunteer out there? ;-) I could have just left this issue off-the-list, of course, and let the authors of the new routed components figure out what was wrong and deal with it. But I thought it would be more friendly to raise the point and see if people had any suggestions on how to resolve the issue -before- it rears its head. So, having done so, perhaps the best solution is option (c) - and let anyone who brings a new routing scheme into the system deal with the collective mis-match problem. As for the relaying operations in the orted noted by Tim: including the relay operation in the grpcomm framework would be extremely difficult, although I won't immediately rule it out as "impossible". The problem is that the orted has to actually process the message - it doesn't just route it on to some other destination like the RML does with a message. Thus, the orted_comm code contains a "relay" function that receives the message, processes it, and then sends it along according to whatever xmission protocol was specified by the sender. To move that code into grpcomm would require that the collectives put a flag in the message indicating "intermediaries who are routing this message need to process it first". Grpcomm would then have to include some mechanism for me to indicate "if you are told to process a message first, then here is the function you need to call". We would then have to add a mechanism in the RML routing system that looks at the message for this flag and calls the "process it" function before continuing the route. I had considered the alternative of calling the routed component to get the next recipient for the message (instead of computing it myself), which would at least remove the computation of the next recipients from the orted. I would think that would be a more feasible next step, though it would take development of another routed component to support things like the binomial xcast algorithm, and possibly a change to the routed framework API (since algo's like binomial might have to return more than one "next recipient"). It also could get a little tricky as the routed component might have to include logic to deal with some of the special use-cases currently handled in grpcomm. All of this is non-trivial, which is why nobody tried to do it! If you want to tackle that area of the code, we would welcome the volunteer - all I ask is that you do it in a tmp branch somewhere first so we can test it. Ralph On 12/5/07 9:29 AM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote: > To me, (a) is dumb and (c) isn't a non-starter. > > The whole point of the component system is to seperate concerns. Routing > topology and collectives operations are two difference concerns. While > there's some overlap (a topology-aware collective doesn't make sense when > using the unity routing structure), it's not overlap in the one implies > you need the other. I can think of a couple of different ways of > implementing the group communication framework, all of which are totally > independent of the particulars of how routing is tracked. > > (b) has a very reasonable track record of working well on the OMPI side > (the mpool / btl thing figures itself out fairly well). Bringing such a > setup over to ORTE wouldn't be bad, but a bit hackish. > > Of course, there's at most two routed components built at any time, and > the defaults are all most non-debugging people will ever need, so I guess > I"m not convinced (c) isn't a non-starter. > > Brian > > On Wed, 5 Dec 2007, Tim Prins wrote: > >> To me, (c) is a non-starter. I think whenever possible we should be >> automatically doing the right thing. The user should not need to have >> any idea how things work inside the library. >> >> Between options (a) and (b), I don't really care. >> >> (b) would be great if we had a mca component dependency system which has >> been much talked about. But without such a system it gets messy. >> >> (a) has the advantage of making sure there is no problems and allowing >> the 2 systems to interact very nicely together, but it also might add a >> large burden to a component writer. >> >> On a related, but slightly different topic, one thing that has always >> bothered me about the grpcomm/routed implementation is that it is not >> self contained. There is logic for routing algorithms outside of the >> components (for example, in orte/orted/orted_comm.c). So, if there are >> any overhauls planned I definitely think this needs to be cleaned up. >> >> Thanks, >> >> Tim >> >> Ralph H Castain wrote: >>> II. Interaction between the ROUTED and GRPCOMM frameworks >>> When we initially developed these two frameworks within the RTE, we >>> envisioned them to operate totally independently of each other. Thus, the >>> grpcomm collectives provide algorithms such as a binomial "xcast" that uses >>> the daemons to scalably send messages across the system. >>> >>> However, we recently realized that the efficacy of the current grpcomm >>> algorithms directly hinge on the daemons being fully connected - which we >>> were recently told may not be the case as other people introduce different >>> ROUTED components. For example, using the binomial algorithm in grpcomm's >>> xcast while having a ring topology selected in ROUTED would likely result in >>> terrible performance. >>> >>> This raises the following questions: >>> >>> (a) should the GRPCOMM and ROUTED frameworks be consolidated to ensure that >>> the group collectives algorithms properly "match" the communication >>> topology? >>> >>> (b) should we automatically select the grpcomm/routed pairings based on some >>> internal logic? >>> >>> (c) should we leave this "as-is" and the user is responsible for making >>> intelligent choices (and for detecting when the performance is bad due to >>> this mismatch)? >>> >>> (d) other suggestions? >>> >>> Ralph >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>