I can not provide the user report as it is a proprietary problem. However, it 
consists of a large loop of calls to MPI_Bcast that crashes due to unexpected 
messages. We have been looking at instituting flow control, but that has way 
too widespread an impact. The coll/sync component would be a simple solution.

I honestly don’t believe the issue I was resolving was due to a bug - it was a 
simple problem of one proc running slow and creating an overload of unexpected 
messages that eventually consumed too much memory. Rather, I think you solved a 
different problem - by the time you arrived at LANL, the app I was working with 
had already modified their code to no longer create the problem (essentially 
refactoring the algorithm to avoid the massive loop over allreduce).

I have no issue supporting it as it takes near-zero effort to maintain, and 
this is a fairly common problem with legacy codes that don’t want to refactor 
their algorithms.


> On Aug 19, 2016, at 8:48 PM, Nathan Hjelm <hje...@me.com> wrote:
> 
>> On Aug 19, 2016, at 4:24 PM, r...@open-mpi.org wrote:
>> 
>> Hi folks
>> 
>> I had a question arise regarding a problem being seen by an OMPI user - has 
>> to do with the old bugaboo I originally dealt with back in my LANL days. The 
>> problem is with an app that repeatedly hammers on a collective, and gets 
>> overwhelmed by unexpected messages when one of the procs falls behind.
> 
> I did some investigation on roadrunner several years ago and determined that 
> the user code issue coll/sync was attempting to fix was due to a bug in 
> ob1/cksum (really can’t remember). coll/sync was simply masking a live-lock 
> problem. I committed a workaround for the bug in r26575 
> (https://github.com/open-mpi/ompi/commit/59e529cf1dfe986e40d14ec4d2a2e5ef0cea5e35)
>  and tested it with the user code. After this change the user code ran fine 
> without coll/sync. Since lanl no longer had any users of coll/sync we stopped 
> supporting it.
> 
>> I solved this back then by introducing the “sync” component in 
>> ompi/mca/coll, which injected a barrier operation every N collectives. You 
>> could even “tune” it by doing the injection for only specific collectives.
>> 
>> However, I can no longer find that component in the code base - I find it in 
>> the 1.6 series, but someone removed it during the 1.7 series.
>> 
>> Can someone tell me why this was done??? Is there any reason not to bring it 
>> back? It solves a very real, not uncommon, problem.
>> Ralph
> 
> This was discussed during one (or several) tel-cons years ago. We agreed to 
> kill it and bring it back if there is 1) a use case, and 2) someone is 
> willing to support it. See 
> https://github.com/open-mpi/ompi/commit/5451ee46bd6fcdec002b333474dec919475d2d62
>  .
> 
> Can you link the user email?
> 
> -Nathan
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to