Hi folks

I had a question arise regarding a problem being seen by an OMPI user - has to 
do with the old bugaboo I originally dealt with back in my LANL days. The 
problem is with an app that repeatedly hammers on a collective, and gets 
overwhelmed by unexpected messages when one of the procs falls behind.

I solved this back then by introducing the “sync” component in ompi/mca/coll, 
which injected a barrier operation every N collectives. You could even “tune” 
it by doing the injection for only specific collectives.

However, I can no longer find that component in the code base - I find it in 
the 1.6 series, but someone removed it during the 1.7 series.

Can someone tell me why this was done??? Is there any reason not to bring it 
back? It solves a very real, not uncommon, problem.
Ralph

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to