On Mon, Jun 3, 2019 at 6:56 PM Zhang, Junchao via petsc-users < petsc-users@mcs.anl.gov> wrote:
> On Mon, Jun 3, 2019 at 5:23 PM Stefano Zampini <stefano.zamp...@gmail.com> > wrote: > >> >> >> On Jun 4, 2019, at 1:17 AM, Zhang, Junchao via petsc-users < >> petsc-users@mcs.anl.gov> wrote: >> >> Sanjay & Barry, >> Sorry, I made a mistake that I said I could reproduced Sanjay's >> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() when >> petsc is configured without debugging, I have to add -malloc to run the >> program. 2) I have to instrument the code outside of KSPSolve(). In my >> case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside >> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results >> were misleading. >> With these fixes, I measured differences of RSS and Petsc malloc >> before/after KSPSolve. I did experiments on MacBook >> using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with >> commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler >> -ts_max_steps 500 -malloc. >> I find if the grid size is small, I can see a non-zero RSS-delta >> randomly, either with one mpi rank or multiple ranks, with MPICH or >> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I >> only see non-zero RSS-delta randomly at the first few iterations (with >> MPICH or OpenMPI). When the computer workload is high by simultaneously >> running ex5-openmpi and ex5-mpich, the MPICH one pops up much more non-zero >> RSS-delta. But "Malloc Delta" behavior is stable across all runs. There is >> only one nonzero malloc delta value in the first KSPSolve call. All >> remaining are zero. Something like this: >> >> mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler >> -ts_max_steps 500 -malloc >> RSS Delta= 32489472, Malloc Delta= 26290304, RSS End= >> 136114176 >> RSS Delta= 32768, Malloc Delta= 0, RSS End= >> 138510336 >> RSS Delta= 0, Malloc Delta= 0, RSS End= >> 138522624 >> RSS Delta= 0, Malloc Delta= 0, RSS End= >> 138539008 >> >> So I think I can conclude there is no unfreed memory in KSPSolve() >> allocated by PETSc. Has MPICH allocated unfreed memory in KSPSolve? That >> is possible and I am trying to find a way like PetscMallocGetCurrentUsage() >> to measure that. Also, I think RSS delta is not a good way to measure >> memory allocation. It is dynamic and depends on state of the computer >> (swap, shared libraries loaded etc) when running the code. We should focus >> on malloc instead. If there was a valgrind tool, like performance >> profiling tools, that can let users measure memory allocated but not freed >> in a user specified code segment, that would be very helpful in this case. >> But I have not found one. >> >> >> Junchao >> >> Have you ever tried Massif? >> http://valgrind.org/docs/manual/ms-manual.html >> > > No. I came across it but not familiar with it. I did not find APIs to > call to get current memory usage. Will look at it further. Thanks. > This is definitely the correct tool. It intercepts all calls to malloc()/free() so it can give you the complete picture of allocated memory at any time. It will draw a line graph of this labeled by the routine that does each allocation. Matt > Sanjay, did you say currently you could run with OpenMPI without out of >> memory, but with MPICH, you ran out of memory? Is it feasible to share >> your code so that I can test with? Thanks. >> >> --Junchao Zhang >> >> On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee <s...@berkeley.edu> wrote: >> >>> Barry, >>> >>> If you look at the graphs I generated (on my Mac), you will see that >>> OpenMPI and MPICH have very different values (along with the fact that >>> MPICH does not seem to adhere >>> to the standard (for releasing MPI_ISend resources following and >>> MPI_Wait). >>> >>> -sanjay >>> >>> PS: I agree with Barry's assessment; this is really not that acceptable. >>> >>> On 6/1/19 1:00 AM, Smith, Barry F. wrote: >>> > Junchao, >>> > >>> > This is insane. Either the OpenMPI library or something in the >>> OS underneath related to sockets and interprocess communication is grabbing >>> additional space for each round of MPI communication! Does MPICH have the >>> same values or different values than OpenMP? When you run on Linux do you >>> get the same values as Apple or different. --- Same values seem to indicate >>> the issue is inside OpenMPI/MPICH different values indicates problem is >>> more likely at the OS level. Does this happen only with the default >>> VecScatter that uses blocking MPI, what happens with PetscSF under Vec? Is >>> it somehow related to PETSc's use of nonblocking sends and receives? One >>> could presumably use valgrind to see exactly what lines in what code are >>> causing these increases. I don't think we can just shrug and say this is >>> the way it is, we need to track down and understand the cause (and if >>> possible fix). >>> > >>> > Barry >>> > >>> > >>> >> On May 31, 2019, at 2:53 PM, Zhang, Junchao <jczh...@mcs.anl.gov> >>> wrote: >>> >> >>> >> Sanjay, >>> >> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted >>> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and >>> end of KSPSolve and then computed the delta and summed over processes. Then >>> I tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c >>> >> With OpenMPI, >>> >> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler >>> -ts_max_steps 500 > 128.log >>> >> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 128.log >>> >> 1:RSS Delta= 69632, Malloc Delta= 0 >>> >> 2:RSS Delta= 69632, Malloc Delta= 0 >>> >> 3:RSS Delta= 69632, Malloc Delta= 0 >>> >> 4:RSS Delta= 69632, Malloc Delta= 0 >>> >> 9:RSS Delta=9.25286e+06, Malloc Delta= 0 >>> >> 22:RSS Delta= 49152, Malloc Delta= 0 >>> >> 44:RSS Delta= 20480, Malloc Delta= 0 >>> >> 53:RSS Delta= 49152, Malloc Delta= 0 >>> >> 66:RSS Delta= 4096, Malloc Delta= 0 >>> >> 97:RSS Delta= 16384, Malloc Delta= 0 >>> >> 119:RSS Delta= 20480, Malloc Delta= 0 >>> >> 141:RSS Delta= 53248, Malloc Delta= 0 >>> >> 176:RSS Delta= 16384, Malloc Delta= 0 >>> >> 308:RSS Delta= 16384, Malloc Delta= 0 >>> >> 352:RSS Delta= 16384, Malloc Delta= 0 >>> >> 550:RSS Delta= 16384, Malloc Delta= 0 >>> >> 572:RSS Delta= 16384, Malloc Delta= 0 >>> >> 669:RSS Delta= 40960, Malloc Delta= 0 >>> >> 924:RSS Delta= 32768, Malloc Delta= 0 >>> >> 1694:RSS Delta= 20480, Malloc Delta= 0 >>> >> 2099:RSS Delta= 16384, Malloc Delta= 0 >>> >> 2244:RSS Delta= 20480, Malloc Delta= 0 >>> >> 3001:RSS Delta= 16384, Malloc Delta= 0 >>> >> 5883:RSS Delta= 16384, Malloc Delta= 0 >>> >> >>> >> If I increased the grid >>> >> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler >>> -ts_max_steps 500 -malloc_test >512.log >>> >> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 512.log >>> >> 1:RSS Delta=1.05267e+06, Malloc Delta= 0 >>> >> 2:RSS Delta=1.05267e+06, Malloc Delta= 0 >>> >> 3:RSS Delta=1.05267e+06, Malloc Delta= 0 >>> >> 4:RSS Delta=1.05267e+06, Malloc Delta= 0 >>> >> 13:RSS Delta=1.24932e+08, Malloc Delta= 0 >>> >> >>> >> So we did see RSS increase in 4k-page sizes after KSPSolve. As long >>> as no memory leaks, why do you care about it? Is it because you run out of >>> memory? >>> >> >>> >> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. <bsm...@mcs.anl.gov> >>> wrote: >>> >> >>> >> Thanks for the update. So the current conclusions are that using >>> the Waitall in your code >>> >> >>> >> 1) solves the memory issue with OpenMPI in your code >>> >> >>> >> 2) does not solve the memory issue with PETSc KSPSolve >>> >> >>> >> 3) MPICH has memory issues both for your code and PETSc KSPSolve >>> (despite) the wait all fix? >>> >> >>> >> If you literately just comment out the call to KSPSolve() with >>> OpenMPI is there no growth in memory usage? >>> >> >>> >> >>> >> Both 2 and 3 are concerning, indicate possible memory leak bugs in >>> MPICH and not freeing all MPI resources in KSPSolve() >>> >> >>> >> Junchao, can you please investigate 2 and 3 with, for example, a TS >>> example that uses the linear solver (like with -ts_type beuler)? Thanks >>> >> >>> >> >>> >> Barry >>> >> >>> >> >>> >> >>> >>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee <s...@berkeley.edu> >>> wrote: >>> >>> >>> >>> Lawrence, >>> >>> Thanks for taking a look! This is what I had been wondering about >>> -- my knowledge of MPI is pretty minimal and >>> >>> this origins of the routine were from a programmer we hired a >>> decade+ back from NERSC. I'll have to look into >>> >>> VecScatter. It will be great to dispense with our roll-your-own >>> routines (we even have our own reduceALL scattered around the code). >>> >>> >>> >>> Interestingly, the MPI_WaitALL has solved the problem when using >>> OpenMPI but it still persists with MPICH. Graphs attached. >>> >>> I'm going to run with openmpi for now (but I guess I really still >>> need to figure out what is wrong with MPICH and WaitALL; >>> >>> I'll try Barry's suggestion of >>> --download-mpich-configure-arguments="--enable-error-messages=all >>> --enable-g" later today and report back). >>> >>> >>> >>> Regarding MPI_Barrier, it was put in due a problem that some >>> processes were finishing up sending and receiving and exiting the subroutine >>> >>> before the receiving processes had completed (which resulted in data >>> loss as the buffers are freed after the call to the routine). MPI_Barrier >>> was the solution proposed >>> >>> to us. I don't think I can dispense with it, but will think about >>> some more. >>> >>> >>> >>> I'm not so sure about using MPI_IRecv as it will require a bit of >>> rewriting since right now I process the received >>> >>> data sequentially after each blocking MPI_Recv -- clearly slower but >>> easier to code. >>> >>> >>> >>> Thanks again for the help. >>> >>> >>> >>> -sanjay >>> >>> >>> >>> On 5/30/19 4:48 AM, Lawrence Mitchell wrote: >>> >>>> Hi Sanjay, >>> >>>> >>> >>>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users < >>> petsc-users@mcs.anl.gov> wrote: >>> >>>>> >>> >>>>> The problem seems to persist but with a different signature. >>> Graphs attached as before. >>> >>>>> >>> >>>>> Totals with MPICH (NB: single run) >>> >>>>> >>> >>>>> For the CG/Jacobi data_exchange_total = 41,385,984; >>> kspsolve_total = 38,289,408 >>> >>>>> For the GMRES/BJACOBI data_exchange_total = 41,324,544; >>> kspsolve_total = 41,324,544 >>> >>>>> >>> >>>>> Just reading the MPI docs I am wondering if I need some sort of >>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine? >>> >>>>> I would have thought that with the blocking receives and the >>> MPI_Barrier that everything will have fully completed and cleaned up before >>> >>>>> all processes exited the routine, but perhaps I am wrong on that. >>> >>>> Skimming the fortran code you sent you do: >>> >>>> >>> >>>> for i in ...: >>> >>>> call MPI_Isend(..., req, ierr) >>> >>>> >>> >>>> for i in ...: >>> >>>> call MPI_Recv(..., ierr) >>> >>>> >>> >>>> But you never call MPI_Wait on the request you got back from the >>> Isend. So the MPI library will never free the data structures it created. >>> >>>> >>> >>>> The usual pattern for these non-blocking communications is to >>> allocate an array for the requests of length nsend+nrecv and then do: >>> >>>> >>> >>>> for i in nsend: >>> >>>> call MPI_Isend(..., req[i], ierr) >>> >>>> for j in nrecv: >>> >>>> call MPI_Irecv(..., req[nsend+j], ierr) >>> >>>> >>> >>>> call MPI_Waitall(req, ..., ierr) >>> >>>> >>> >>>> I note also there's no need for the Barrier at the end of the >>> routine, this kind of communication does neighbourwise synchronisation, no >>> need to add (unnecessary) global synchronisation too. >>> >>>> >>> >>>> As an aside, is there a reason you don't use PETSc's VecScatter to >>> manage this global to local exchange? >>> >>>> >>> >>>> Cheers, >>> >>>> >>> >>>> Lawrence >>> >>> <cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png> >>> >>> >> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>