Hi folks I had a question arise regarding a problem being seen by an OMPI user - has to do with the old bugaboo I originally dealt with back in my LANL days. The problem is with an app that repeatedly hammers on a collective, and gets overwhelmed by unexpected messages when one of the procs falls behind.
I solved this back then by introducing the “sync” component in ompi/mca/coll, which injected a barrier operation every N collectives. You could even “tune” it by doing the injection for only specific collectives. However, I can no longer find that component in the code base - I find it in the 1.6 series, but someone removed it during the 1.7 series. Can someone tell me why this was done??? Is there any reason not to bring it back? It solves a very real, not uncommon, problem. Ralph _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel