Ralph, Bringing back the coll/sync is a cheap shot at hiding a real issue behind a smoke curtain. As Nathan described in his email, Open MPI lacks of control flow on eager messages is the real culprit here, and the loop around any one-to-many collective (bcast and scatter*) was only helping to exacerbate the issue. However, doing a loop around a small MPI_Send will also end on a memory exhaustion issue, one that would not be easily circumvented by adding synchronizations deep inside the library.
George. On Sat, Aug 20, 2016 at 12:30 AM, [email protected] <[email protected]> wrote: > I can not provide the user report as it is a proprietary problem. However, > it consists of a large loop of calls to MPI_Bcast that crashes due to > unexpected messages. We have been looking at instituting flow control, but > that has way too widespread an impact. The coll/sync component would be a > simple solution. > > I honestly don’t believe the issue I was resolving was due to a bug - it > was a simple problem of one proc running slow and creating an overload of > unexpected messages that eventually consumed too much memory. Rather, I > think you solved a different problem - by the time you arrived at LANL, the > app I was working with had already modified their code to no longer create > the problem (essentially refactoring the algorithm to avoid the massive > loop over allreduce). > > I have no issue supporting it as it takes near-zero effort to maintain, > and this is a fairly common problem with legacy codes that don’t want to > refactor their algorithms. > > > > On Aug 19, 2016, at 8:48 PM, Nathan Hjelm <[email protected]> wrote: > > > >> On Aug 19, 2016, at 4:24 PM, [email protected] wrote: > >> > >> Hi folks > >> > >> I had a question arise regarding a problem being seen by an OMPI user - > has to do with the old bugaboo I originally dealt with back in my LANL > days. The problem is with an app that repeatedly hammers on a collective, > and gets overwhelmed by unexpected messages when one of the procs falls > behind. > > > > I did some investigation on roadrunner several years ago and determined > that the user code issue coll/sync was attempting to fix was due to a bug > in ob1/cksum (really can’t remember). coll/sync was simply masking a > live-lock problem. I committed a workaround for the bug in r26575 ( > https://github.com/open-mpi/ompi/commit/59e529cf1dfe986e40d14ec4d2a2e5 > ef0cea5e35) and tested it with the user code. After this change the user > code ran fine without coll/sync. Since lanl no longer had any users of > coll/sync we stopped supporting it. > > > >> I solved this back then by introducing the “sync” component in > ompi/mca/coll, which injected a barrier operation every N collectives. You > could even “tune” it by doing the injection for only specific collectives. > >> > >> However, I can no longer find that component in the code base - I find > it in the 1.6 series, but someone removed it during the 1.7 series. > >> > >> Can someone tell me why this was done??? Is there any reason not to > bring it back? It solves a very real, not uncommon, problem. > >> Ralph > > > > This was discussed during one (or several) tel-cons years ago. We agreed > to kill it and bring it back if there is 1) a use case, and 2) someone is > willing to support it. See https://github.com/open-mpi/ompi/commit/ > 5451ee46bd6fcdec002b333474dec919475d2d62 . > > > > Can you link the user email? > > > > -Nathan > > _______________________________________________ > > devel mailing list > > [email protected] > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > [email protected] > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list [email protected] https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
