I checked and found MPI_Comm_dup() and MPI_Comm_free() were called in pairs. So the MPI runtime should not complain about running out of resources. I guess there might be pending communications on communicators. But I've no means to know exactly. Per MPI manual, MPI_Comm_free() only marks a communicator object for deallocation. We can file a bug report to OLCF. With MPI source code, it should be easy for them to debug.
--Junchao Zhang On Fri, Aug 20, 2021 at 4:14 PM Junchao Zhang <[email protected]> wrote: > Feimi, > I'm able to reproduce the problem. I will have a look. Thanks a lot for > the example. > --Junchao Zhang > > > On Fri, Aug 20, 2021 at 2:02 PM Feimi Yu <[email protected]> wrote: > >> Sorry, I forgot to destroy the matrix after the loop, but anyway, the >> in-loop preconditioners are destroyed. Updated the code here and the google >> drive. >> >> Feimi >> On 8/20/21 2:54 PM, Feimi Yu wrote: >> >> Hi Barry and Junchao, >> >> Actually I did a simple MPI "dup and free" test before with Spectrum MPI, >> but that one did not have any problem. I'm not a PETSc programmer as I >> mainly use deal.ii's PETSc wrappers, but I managed to write a minimal >> program based on petsc/src/mat/tests/ex98.c to reproduce my problem. This >> piece of code creates and destroys 10,000 instances of Hypre Parasail >> preconditioners (for my own code, it uses Euclid, but I don't think it >> matters). It runs fine with OpenMPI but reports the out of communicator >> error with Sepctrum MPI. The code is attached in the email. In case the >> attachment is not available, I also uploaded a copy on my google drive: >> >> >> https://drive.google.com/drive/folders/1DCf7lNlks8GjazvoP7c211ojNHLwFKL6?usp=sharing >> >> Thanks! >> >> Feimi >> On 8/20/21 9:58 AM, Junchao Zhang wrote: >> >> Feimi, if it is easy to reproduce, could you give instructions on how to >> reproduce that? >> >> PS: Spectrum MPI is based on OpenMPI. I don't understand why it has the >> problem but OpenMPI does not. It could be a bug in petsc or user's code. >> For reference counting on MPI_Comm, we already have petsc inner comm. I >> think we can reuse that. >> >> --Junchao Zhang >> >> >> On Fri, Aug 20, 2021 at 12:33 AM Barry Smith <[email protected]> wrote: >> >>> >>> It sounds like maybe the Spectrum MPI_Comm_free() is not returning the >>> comm to the "pool" as available for future use; a very buggy MPI >>> implementation. This can easily be checked in a tiny standalone MPI program >>> that simply comm dups and frees thousands of times in a loop. Could even be >>> a configure test (that requires running an MPI program). I do not remember >>> if we ever tested this possibility; maybe and I forgot. >>> >>> If this is the problem we can provide a "work around" that attributes >>> the new comm (to be passed to hypre) to the old comm with a reference count >>> value also in the attribute. When the hypre matrix is created that count is >>> (with the new comm) is set to 1, when the hypre matrix is freed that count >>> is set to zero (but the comm is not freed), in the next call to create the >>> hypre matrix when the attribute is found, the count is zero so PETSc knows >>> it can pass the same comm again to the new hypre matrix. >>> >>> This will only allow one simultaneous hypre matrix to be created from >>> the original comm. To allow multiply simultaneous hypre matrix one could >>> have multiple comms and counts in the attribute and just check them until >>> one finds an available one to reuse (or creates yet another one if all the >>> current ones are busy with hypre matrices). So it is the same model as >>> DMGetXXVector() where vectors are checked out and then checked in to be >>> available later. This would solve the currently reported problem (if it is >>> a buggy MPI that does not properly free comms), but not solve the MOOSE >>> problem where 10,000 comms are needed at the same time. >>> >>> Barry >>> >>> >>> >>> >>> >>> On Aug 19, 2021, at 3:29 PM, Junchao Zhang <[email protected]> >>> wrote: >>> >>> >>> >>> >>> On Thu, Aug 19, 2021 at 2:08 PM Feimi Yu <[email protected]> wrote: >>> >>>> Hi Jed, >>>> >>>> In my case, I only have 2 hypre preconditioners at the same time, and >>>> they do not solve simultaneously, so it might not be case 1. >>>> >>>> I checked the stack for all the calls of MPI_Comm_dup/MPI_Comm_free on >>>> my own machine (with OpenMPI), all the communicators are freed from my >>>> observation. I could not test it with Spectrum MPI on the clusters >>>> immediately because all the dependencies were built in release mode. >>>> However, as I mentioned, I haven't had this problem with OpenMPI >>>> before, >>>> so I'm not sure if this is really an MPI implementation problem, or >>>> just >>>> because Spectrum MPI has less limit for the number of communicators, >>>> and/or this also depends on how many MPI ranks are used, as only 2 out >>>> of 40 ranks reported the error. >>>> >>> You can add printf around MPI_Comm_dup/MPI_Comm_free sites on the two >>> ranks, e.g., if (myrank == 38) printf(...), to see if the dup/free are >>> paired. >>> >>> As a workaround, I replaced the MPI_Comm_dup() at >>> >>>> petsc/src/mat/impls/hypre/mhypre.c:2120 with a copy assignment, and >>>> also >>>> removed the MPI_Comm_free() in the hypre destroyer. My code runs fine >>>> with Spectrum MPI now, but I don't think this is a long-term solution. >>>> >>>> Thanks! >>>> >>>> Feimi >>>> >>>> On 8/19/21 9:01 AM, Jed Brown wrote: >>>> > Junchao Zhang <[email protected]> writes: >>>> > >>>> >> Hi, Feimi, >>>> >> I need to consult Jed (cc'ed). >>>> >> Jed, is this an example of >>>> >> >>>> https://lists.mcs.anl.gov/mailman/htdig/petsc-dev/2018-April/thread.html#22663 >>>> ? >>>> >> If Feimi really can not free matrices, then we just need to attach a >>>> >> hypre-comm to a petsc inner comm, and pass that to hypre. >>>> > Are there a bunch of solves as in that case? >>>> > >>>> > My understanding is that one should be able to >>>> MPI_Comm_dup/MPI_Comm_free as many times as you like, but the >>>> implementation has limits on how many communicators can co-exist at any one >>>> time. The many-at-once is what we encountered in that 2018 thread. >>>> > >>>> > One way to check would be to use a debugger or tracer to examine the >>>> stack every time (P)MPI_Comm_dup and (P)MPI_Comm_free are called. >>>> > >>>> > case 1: we'll find lots of dups without frees (until the end) because >>>> the user really wants lots of these existing at the same time. >>>> > >>>> > case 2: dups are unfreed because of reference counting >>>> issue/inessential references >>>> > >>>> > >>>> > In case 1, I think the solution is as outlined in the thread, PETSc >>>> can create an inner-comm for Hypre. I think I'd prefer to attach it to the >>>> outer comm instead of the PETSc inner comm, but perhaps a case could be >>>> made either way. >>>> >>> >>>
