Hi, Richard, I tested the case you sent over and found it did fail due to the 32-bit overflow on number of non-zeros, and with a 64-bit built petsc it passed. You had a typo when you reported that --with-64-bit-indicies=yes failed. It should be --with-64-bit-indices=yes. You can go with a 64-bit built petsc, or you can go with parallel computing and run with multiple MPI ranks so that each rank has less non-zeros and it is faster (but you need to make sure that code is correctly parallelized). Barry's recent fix ierr = PetscIntCast(nz64,&nz);CHKERRQ(ierr); would print more useful error messages in this case. Barry, should we patch it back to 3.6.3?
--Junchao Zhang On Sun, Feb 16, 2020 at 11:37 PM Junchao Zhang <jczh...@mcs.anl.gov> wrote: > Richard, > I managed to get the code Simlul@trophy built. Could you tell me how to > run your test? I want to see if I can reproduce the error. Thanks > > --Junchao Zhang > > > On Fri, Feb 14, 2020 at 8:34 PM Richard Beare <richard.be...@monash.edu> > wrote: > >> It doesn't compile out of the box with master. >> >> singularity def file attached. >> >> On Sat, 15 Feb 2020 at 08:03, Richard Beare <richard.be...@monash.edu> >> wrote: >> >>> I will see if I can build with master. The docs for simulatrophy say >>> 3.6.3.1. >>> >>> On Sat, 15 Feb 2020 at 02:47, Junchao Zhang <jczh...@mcs.anl.gov> wrote: >>> >>>> Which petsc version do you use? In aij.c of the master branch, I saw >>>> Barry recently added a useful check to catch number of nonzero overflow, >>>> ierr = PetscIntCast(nz64,&nz);CHKERRQ(ierr); But you mentioned using >>>> 64-bit indices did not solve the problem, it might not be the reason. You >>>> should try the master branch if feasible. Also, vary number of MPI ranks to >>>> see if error stack changes. >>>> >>>> --Junchao Zhang >>>> >>>> >>>> On Fri, Feb 14, 2020 at 5:12 AM Richard Beare via petsc-users < >>>> petsc-users@mcs.anl.gov> wrote: >>>> >>>>> No luck - exactly the same error after including the >>>>> --with-64-bit-indicies=yes --download-mpich=yes options >>>>> >>>>> ==8674== Argument 'size' of function memalign has a fishy (possibly >>>>> negative) value: -17152036540 >>>>> ==8674== at 0x4C320A6: memalign (in >>>>> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) >>>>> ==8674== by 0x4F0CFF2: PetscMallocAlign(unsigned long, int, char >>>>> const*, char const*, void**) (mal.c:28) >>>>> ==8674== by 0x4F0F716: PetscTrMallocDefault(unsigned long, int, >>>>> char const*, char const*, void**) (mtr.c:188) >>>>> ==8674== by 0x569AF3E: MatSeqAIJSetPreallocation_SeqAIJ (aij.c:3595) >>>>> ==8674== by 0x569A531: MatSeqAIJSetPreallocation (aij.c:3539) >>>>> ==8674== by 0x599080A: DMCreateMatrix_DA_3d_MPIAIJ(_p_DM*, _p_Mat*) >>>>> (fdda.c:1085) >>>>> ==8674== by 0x598B937: DMCreateMatrix_DA(_p_DM*, _p_Mat**) >>>>> (fdda.c:759) >>>>> ==8674== by 0x58A2BF2: DMCreateMatrix (dm.c:956) >>>>> ==8674== by 0x5E377B3: KSPSetUp (itfunc.c:262) >>>>> ==8674== by 0x409FFC: PetscAdLemTaras3D::solveModel(bool) >>>>> (PetscAdLemTaras3D.hxx:255) >>>>> ==8674== by 0x4239FB: AdLem3D<3u>::solveModel(bool, bool, bool) >>>>> (AdLem3D.hxx:551) >>>>> ==8674== by 0x41BD17: main (PetscAdLemMain.cxx:344) >>>>> ==8674== >>>>> On Fri, 14 Feb 2020 at 17:07, Smith, Barry F. <bsm...@mcs.anl.gov> >>>>> wrote: >>>>> >>>>>> >>>>>> Richard, >>>>>> >>>>>> It is likely that for these problems some of the integers become >>>>>> too large for the int variable to hold them, thus they overflow and >>>>>> become >>>>>> negative. >>>>>> >>>>>> You should make a new PETSC_ARCH configuration of PETSc that >>>>>> uses the configure option --with-64-bit-indices, this will change PETSc >>>>>> to >>>>>> use 64 bit integers which will not overflow. >>>>>> >>>>>> Good luck and let us know how it works out >>>>>> >>>>>> Barry >>>>>> >>>>>> Probably the code is built with an older version of PETSc; the >>>>>> later versions should produce a more useful error message. >>>>>> >>>>>> > On Feb 13, 2020, at 11:43 PM, Richard Beare via petsc-users < >>>>>> petsc-users@mcs.anl.gov> wrote: >>>>>> > >>>>>> > Hi Everyone, >>>>>> > I am experimenting with the Simlul@trophy tool ( >>>>>> https://github.com/Inria-Asclepios/simul-atrophy) that uses petsc to >>>>>> simulate brain atrophy based on segmented MRI data. I am not the author. >>>>>> I >>>>>> have this running on most of a dataset of about 50 scans, but experience >>>>>> crashes with several that I am trying to track down. However I am out of >>>>>> ideas. The problem images are slightly bigger than some of the successful >>>>>> ones, but not substantially so, and I have experimented on machines with >>>>>> sufficient RAM. The error happens very quickly, as part of setup - see >>>>>> the >>>>>> valgrind report below. I haven't managed to get the sgcheck tool to work >>>>>> yet. I can only guess that the ksp object is somehow becoming corrupted >>>>>> during the setup process, but the array sizes that I can track (which >>>>>> derive from image sizes), appear correct at every point I can check. Any >>>>>> suggestions as to how I can check what might go wrong in the setup of the >>>>>> ksp object? >>>>>> > Thankyou. >>>>>> > >>>>>> > valgrind tells me: >>>>>> > >>>>>> > ==18175== Argument 'size' of function memalign has a fishy >>>>>> (possibly negative) value: -17152038144 >>>>>> > ==18175== at 0x4C320A6: memalign (in >>>>>> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) >>>>>> > ==18175== by 0x4F0F1F2: PetscMallocAlign(unsigned long, int, >>>>>> char const*, char const*, void**) (mal.c:28) >>>>>> > ==18175== by 0x56B43CA: MatSeqAIJSetPreallocation_SeqAIJ >>>>>> (aij.c:3595) >>>>>> > ==18175== by 0x56B39BD: MatSeqAIJSetPreallocation (aij.c:3539) >>>>>> > ==18175== by 0x59A9B44: DMCreateMatrix_DA_3d_MPIAIJ(_p_DM*, >>>>>> _p_Mat*) (fdda.c:1085) >>>>>> > ==18175== by 0x59A4C71: DMCreateMatrix_DA(_p_DM*, _p_Mat**) >>>>>> (fdda.c:759) >>>>>> > ==18175== by 0x58BBD29: DMCreateMatrix (dm.c:956) >>>>>> > ==18175== by 0x5E509D5: KSPSetUp (itfunc.c:262) >>>>>> > ==18175== by 0x40A3DE: PetscAdLemTaras3D::solveModel(bool) >>>>>> (PetscAdLemTaras3D.hxx:269) >>>>>> > ==18175== by 0x42413F: AdLem3D<3u>::solveModel(bool, bool, bool) >>>>>> (AdLem3D.hxx:552) >>>>>> > ==18175== by 0x41C25C: main (PetscAdLemMain.cxx:349) >>>>>> > ==18175== >>>>>> > >>>>>> > -- >>>>>> > -- >>>>>> > A/Prof Richard Beare >>>>>> > Imaging and Bioinformatics, Peninsula Clinical School >>>>>> > orcid.org/0000-0002-7530-5664 >>>>>> > richard.be...@monash.edu >>>>>> > +61 3 9788 1724 >>>>>> > >>>>>> > >>>>>> > >>>>>> > Geospatial Research: >>>>>> https://www.monash.edu/medicine/scs/medicine/research/geospatial-analysis >>>>>> >>>>>> >>>>> >>>>> -- >>>>> -- >>>>> A/Prof Richard Beare >>>>> Imaging and Bioinformatics, Peninsula Clinical School >>>>> orcid.org/0000-0002-7530-5664 >>>>> richard.be...@monash.edu >>>>> +61 3 9788 1724 >>>>> >>>>> >>>>> >>>>> Geospatial Research: >>>>> https://www.monash.edu/medicine/scs/medicine/research/geospatial-analysis >>>>> >>>> >>> >>> -- >>> -- >>> A/Prof Richard Beare >>> Imaging and Bioinformatics, Peninsula Clinical School >>> orcid.org/0000-0002-7530-5664 >>> richard.be...@monash.edu >>> +61 3 9788 1724 >>> >>> >>> >>> Geospatial Research: >>> https://www.monash.edu/medicine/scs/medicine/research/geospatial-analysis >>> >> >> >> -- >> -- >> A/Prof Richard Beare >> Imaging and Bioinformatics, Peninsula Clinical School >> orcid.org/0000-0002-7530-5664 >> richard.be...@monash.edu >> +61 3 9788 1724 >> >> >> >> Geospatial Research: >> https://www.monash.edu/medicine/scs/medicine/research/geospatial-analysis >> >