On Tue, Dec 19, 2023 at 5:11 AM Joauma Marichal < joauma.maric...@uclouvain.be> wrote:
> Hello, > > > > I have used Address Sanitizer to check any memory errors. On my computer, > no errors are found. Unfortunately, on the supercomputer that I am using, I > get lots of errors… I attach my log files (running on 1 and 70 procs). > > Do you have any idea of what I could do? > Run the same parallel configuration as you do on the supercomputer. If that is fine, I would suggest Address Sanitizer there. Something is corrupting the stack, and it appears that it is connected to that machine, rather than the library. Do you have access to a second parallel machine? Thanks, Matt > Thanks a lot for your help. > > > > Best regards, > > > > Joauma > > > > *De : *Matthew Knepley <knep...@gmail.com> > *Date : *lundi, 18 décembre 2023 à 12:00 > *À : *Joauma Marichal <joauma.maric...@uclouvain.be> > *Cc : *petsc-ma...@mcs.anl.gov <petsc-ma...@mcs.anl.gov>, > petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Objet : *Re: [petsc-maint] DMSwarm on multiple processors > > On Mon, Dec 18, 2023 at 5:09 AM Joauma Marichal < > joauma.maric...@uclouvain.be> wrote: > > Hello, > > > > Sorry for the delay. I attach the file that I obtain when running the code > with the debug mode. > > > > Okay, we can now see where this is happening: > > > > malloc_consolidate(): invalid chunk size > [cns263:3265170] *** Process received signal *** > [cns263:3265170] Signal: Aborted (6) > [cns263:3265170] Signal code: (-6) > [cns263:3265170] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f3bd9148b20] > [cns263:3265170] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3bd9148a9f] > [cns263:3265170] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3bd911be05] > [cns263:3265170] [ 3] /lib64/libc.so.6(+0x91037)[0x7f3bd918b037] > [cns263:3265170] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f3bd919219c] > [cns263:3265170] [ 5] /lib64/libc.so.6(+0x98b68)[0x7f3bd9192b68] > [cns263:3265170] [ 6] /lib64/libc.so.6(+0x9af18)[0x7f3bd9194f18] > [cns263:3265170] [ 7] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f3bd9196822] > [cns263:3265170] [ 8] /lib64/libc.so.6(posix_memalign+0x3c)[0x7f3bd91980fc] > [cns263:3265170] [ 9] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocAlign+0x45)[0x7f3bda5f1625] > [cns263:3265170] [10] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocA+0x297)[0x7f3bda5f1b07] > [cns263:3265170] [11] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMCreate+0x5b)[0x7f3bdaa73c1b] > [cns263:3265170] [12] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate+0x9)[0x7f3bdab0a2f9] > [cns263:3265170] [13] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate3d+0x9a)[0x7f3bdab07dea] > [cns263:3265170] [14] ./cobpor[0x402de8] > [cns263:3265170] [15] > /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f3bd9134cf3] > [cns263:3265170] [16] ./cobpor[0x40304e] > [cns263:3265170] *** End of error message *** > > > > However, this is not great. First, the amount of memory being allocated is > quite small, and this does not appear to be an Out of Memory error. Second, > the error occurs in libc: > > > > malloc_consolidate(): invalid chunk size > > > > which means something is wrong internally. I agree with this analysis ( > https://stackoverflow.com/questions/18760999/sample-example-program-to-get-the-malloc-consolidate-error) > that says you have probably overwritten memory somewhere in your code. I > recommend running under valgrind, or using Address Sanitizer from clang. > > > > Thanks, > > > > Matt > > > > Thanks for your help. > > > > Best regards, > > > > Joauma > > > > *De : *Matthew Knepley <knep...@gmail.com> > *Date : *jeudi, 23 novembre 2023 à 15:32 > *À : *Joauma Marichal <joauma.maric...@uclouvain.be> > *Cc : *petsc-ma...@mcs.anl.gov <petsc-ma...@mcs.anl.gov>, > petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Objet : *Re: [petsc-maint] DMSwarm on multiple processors > > On Thu, Nov 23, 2023 at 9:01 AM Joauma Marichal < > joauma.maric...@uclouvain.be> wrote: > > Hello, > > > > My problem persists… Is there anything I could try? > > > > Yes. It appears to be failing from a call inside PetscSFSetUpRanks(). It > does allocation, and the failure > > is in libc, and it only happens on larger examples, so I suspect some > allocation problem. Can you rebuild with debugging and run this example? > Then we can see if the allocation fails. > > > > Thanks, > > Matt > > > > Thanks a lot. > > > > Best regards, > > > > Joauma > > > > *De : *Matthew Knepley <knep...@gmail.com> > *Date : *mercredi, 25 octobre 2023 à 14:45 > *À : *Joauma Marichal <joauma.maric...@uclouvain.be> > *Cc : *petsc-ma...@mcs.anl.gov <petsc-ma...@mcs.anl.gov>, > petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Objet : *Re: [petsc-maint] DMSwarm on multiple processors > > On Wed, Oct 25, 2023 at 8:32 AM Joauma Marichal via petsc-maint < > petsc-ma...@mcs.anl.gov> wrote: > > Hello, > > > > I am using the DMSwarm library in some Eulerian-Lagrangian approach to > have vapor bubbles in water. > > I have obtained nice results recently and wanted to perform bigger > simulations. Unfortunately, when I increase the number of processors used > to run the simulation, I get the following error: > > > > free(): invalid size > > [cns136:590327] *** Process received signal *** > > [cns136:590327] Signal: Aborted (6) > > [cns136:590327] Signal code: (-6) > > [cns136:590327] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f56cd4c9b20] > > [cns136:590327] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f56cd4c9a9f] > > [cns136:590327] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f56cd49ce05] > > [cns136:590327] [ 3] /lib64/libc.so.6(+0x91037)[0x7f56cd50c037] > > [cns136:590327] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f56cd51319c] > > [cns136:590327] [ 5] /lib64/libc.so.6(+0x99aac)[0x7f56cd514aac] > > [cns136:590327] [ 6] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUpRanks+0x4c4)[0x7f56cea71e64] > > [cns136:590327] [ 7] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(+0x841642)[0x7f56cea83642] > > [cns136:590327] [ 8] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUp+0x9e)[0x7f56cea7043e] > > [cns136:590327] [ 9] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(VecScatterCreate+0x164e)[0x7f56cea7bbde] > > [cns136:590327] [10] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA_3D+0x3e38)[0x7f56cee84dd8] > > [cns136:590327] [11] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA+0xd8)[0x7f56cee9b448] > > [cns136:590327] [12] > /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp+0x20)[0x7f56cededa20] > > [cns136:590327] [13] ./cobpor[0x4418dc] > > [cns136:590327] [14] ./cobpor[0x408b63] > > [cns136:590327] [15] > /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f56cd4b5cf3] > > [cns136:590327] [16] ./cobpor[0x40bdee] > > [cns136:590327] *** End of error message *** > > -------------------------------------------------------------------------- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code. Per user-direction, the job has been aborted. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpiexec noticed that process rank 84 with PID 590327 on node cns136 exited > on signal 6 (Aborted). > > -------------------------------------------------------------------------- > > > > When I reduce the number of processors the error disappears and when I run > my code without the vapor bubbles it also works. > > The problem seems to take place at this moment: > > > > DMCreate(PETSC_COMM_WORLD,swarm); > > DMSetType(*swarm,DMSWARM); > > DMSetDimension(*swarm,3); > > DMSwarmSetType(*swarm,DMSWARM_PIC); > > DMSwarmSetCellDM(*swarm,*dmcell); > > > > > > Thanks a lot for your help. > > > > Things that would help us track this down: > > > > 1) The smallest example where it fails > > > > 2) The smallest number of processes where it fails > > > > 3) A stack trace of the failure > > > > 4) A simple example that we can run that also fails > > > > Thanks, > > > > Matt > > > > Best regards, > > > > Joauma > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>