Yes, this is a real issue for MOOSE which sometimes has thousands of active 
single-field solvers.  PETSc can limit the number of fine-level communicators 
by retaining the dup'd communicator so the same communicator can be passed to 
hypre for each solver, but cannot control the MPI_Comm_create for a parallel 
coarse level.  Hypre could do that internally by attaching the coarse 
communicator as an attribute (on the relevant ranks) of the larger communicator.

The separate tag space is important because point-to-point messaging can be 
pending when hypre is called -- that does not lead to deadlock, but it is 
important that hypre not post sends or receives with those tags lest messages 
be delivered incorrectly.  

I feel like your response below is a false economy.  Nobody would fault hypre 
for dup'ing once.  But with the current interface, it is laborious to 
impossible (in case of parallel coarse solve) to create a thousand hypre 
solvers without having a thousand communicators.  Assuming you are not 
convinced, we will handle this in PETSc the same way PETSc does for itself, but 
(a) we still can't control the communicator for a parallel coarse solve and (b) 
this issue may crop up again if some other user attempts to do this sort of 
solve without using PETSc.

Rob Falgout hypre Tracker <[email protected]> writes:

> Rob Falgout <[email protected]> added the comment:
>
> Is somebody actually having a problem with communicator conflicts right now?
>
> I thought the reason for this thread was to reduce the number of 
> communicators because of limits in MPI implementations.  Somebody has to 
> reduce the Comm_create() and Comm_dup() calls.  We responded with one way to 
> reduce the create() calls in BoomerAMG, but now you are asking us to put them 
> back in by calling dup()?  I'm confused about what we are trying to achieve 
> here now.
>
> The reason I suggested that the user be responsible for calling dup() is 
> twofold: 1) I don't think it is common for users to run hypre in parallel 
> with other user code where both are using the same communicator (I'm not sure 
> how this could even work without deadlocking since hypre calls are 
> collective); 2) Making libraries lower down on the call stack be responsible 
> for calling dup() seems less scalable than the other way around and more 
> likely to increase the number of communicators used.
>
> Anyway, I'm still confused about what we are trying to achieve so maybe 
> somebody can try to summarize again?
>
> -Rob
>
> ____________________________________________
> hypre Issue Tracker <[email protected]>
> <http://cascb1.llnl.gov/hypre/issue1595>
> ____________________________________________

Reply via email to