I have pushed a new branch to the PETSc repository called barry/reduce-dmsetup-da-memoryusage which cuts to 1/3 the amount of memory that is order dof * number of local grid points in the DMSetUp(), with this change you should see a pretty good improvement in "wasted" memory.
See https://bitbucket.org/petsc/petsc/wiki/Home for accessing the branch. Barry On Oct 22, 2013, at 3:57 AM, Juha Jäykkä <[email protected]> wrote: > Barry, > > I seem to have touched a topic which goes way past my knowledge of PETSc > internals, but it's very nice to see a thorough response nevertheless. Thank > you. And Matthew, too. > > After reading your suspicions about number of ranks, I tried with 1, 2 and 4 > and the memory use indeed seems to go down from 1: > > juhaj@dhcp071> CMD='import helpers; > procdata=helpers._ProcessMemoryInfoProc(); > print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; from petsc4py > import PETSc; procdata=helpers._ProcessMemoryInfoProc(); print > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; da = > PETSc.DA().create(sizes=[100,100,100], > proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE], boundary_type=[3,0,0], > stencil_type=PETSc.DA.StencilType.BOX, dof=7, stencil_width=1, > comm=PETSc.COMM_WORLD); procdata=helpers._ProcessMemoryInfoProc(); print > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]' > juhaj@dhcp071> mpirun -np 1 python -c "$CMD" > 21 MiB / 22280 kB > 21 MiB / 22304 kB > 354 MiB / 419176 kB > juhaj@dhcp071> mpirun -np 2 python -c "$CMD" > 22 MiB / 23276 kB > 22 MiB / 23020 kB > 22 MiB / 23300 kB > 22 MiB / 23044 kB > 141 MiB / 145324 kB > 141 MiB / 145068 kB > juhaj@dhcp071> mpirun -np 4 python -c "$CMD" > 22 MiB / 23292 kB > 22 MiB / 23036 kB > 22 MiB / 23316 kB > 22 MiB / 23060 kB > 22 MiB / 23316 kB > 22 MiB / 23340 kB > 22 MiB / 23044 kB > 22 MiB / 23068 kB > 81 MiB / 83716 kB > 82 MiB / 83976 kB > 81 MiB / 83964 kB > 81 MiB / 83724 kB > > As one would expect, 4 ranks needs more memory than 2 ranks, but quite > unexpectedly, 1 rank needs more than 2! I guess you are right: the > 1-rank-case > is not optimised and quite frankly, I don't mind: I only ever run small tests > with one rank. Unfortunately, trying to create the simplest possible scenario > to illustrate my point, I used a small DA and just one rank, precisely to > avoid the case where the excess memory would be due to MPI buffers or such. > Looks like my plan backfired. ;) > > But even still, my 53 MiB lattice, without any vectors created, takes 280 or > 320 MiB of memory – down to <6 from the original 6.6. > > I will test with 3.3 later today if I have the time, but I'm pretty sure > things were "better" there. > > On Monday 21 October 2013 15:23:01 Barry Smith wrote: >> Matt, >> >> I think you are running on 1 process where the DMDA doesn't have an >> optimized path, when I run on 2 processes the numbers indicate nothing >> proportional to dof* number of local points >> >> dof = 12 >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep >> VecScatter [0] 7 21344 VecScatterCreate() >> [0] 2 32 VecScatterCreateCommon_PtoS() >> [0] 39 182480 VecScatterCreate_PtoS() >> >> dof = 8 >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep >> VecScatter [0] 7 21344 VecScatterCreate() >> [0] 2 32 VecScatterCreateCommon_PtoS() >> [0] 39 176080 VecScatterCreate_PtoS() >> >> dof = 4 >> >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep >> VecScatter [0] 7 21344 VecScatterCreate() >> [0] 2 32 VecScatterCreateCommon_PtoS() >> [0] 39 169680 VecScatterCreate_PtoS() >> >> dof = 2 >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep >> VecScatter [0] 7 21344 VecScatterCreate() >> [0] 2 32 VecScatterCreateCommon_PtoS() >> [0] 39 166480 VecScatterCreate_PtoS() >> >> dof =2 grid is 50 by 50 instead of 100 by 100 >> >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep >> VecScatter [0] 7 6352 VecScatterCreate() >> [0] 2 32 VecScatterCreateCommon_PtoS() >> [0] 39 43952 VecScatterCreate_PtoS() >> >> The IS creation in the DMDA is far more troubling >> >> /Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS >> >> dof = 2 >> >> [0] 1 20400 ISBlockSetIndices_Block() >> [0] 15 3760 ISCreate() >> [0] 4 128 ISCreate_Block() >> [0] 1 16 ISCreate_Stride() >> [0] 2 81600 ISGetIndices_Block() >> [0] 1 20400 ISLocalToGlobalMappingBlock() >> [0] 7 42016 ISLocalToGlobalMappingCreate() >> >> dof = 4 >> >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS >> [0] 1 20400 ISBlockSetIndices_Block() >> [0] 15 3760 ISCreate() >> [0] 4 128 ISCreate_Block() >> [0] 1 16 ISCreate_Stride() >> [0] 2 163200 ISGetIndices_Block() >> [0] 1 20400 ISLocalToGlobalMappingBlock() >> [0] 7 82816 ISLocalToGlobalMappingCreate() >> >> dof = 8 >> >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS >> [0] 1 20400 ISBlockSetIndices_Block() >> [0] 15 3760 ISCreate() >> [0] 4 128 ISCreate_Block() >> [0] 1 16 ISCreate_Stride() >> [0] 2 326400 ISGetIndices_Block() >> [0] 1 20400 ISLocalToGlobalMappingBlock() >> [0] 7 164416 ISLocalToGlobalMappingCreate() >> >> dof = 12 >> ~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS >> [0] 1 20400 ISBlockSetIndices_Block() >> [0] 15 3760 ISCreate() >> [0] 4 128 ISCreate_Block() >> [0] 1 16 ISCreate_Stride() >> [0] 2 489600 ISGetIndices_Block() >> [0] 1 20400 ISLocalToGlobalMappingBlock() >> [0] 7 246016 ISLocalToGlobalMappingCreate() >> >> Here the accessing of indices is at the point level (as well as block) and >> hence memory usage is proportional to dof* local number of grid points. Of >> course it is still only proportional to the vector size. There is some >> improvement we could make it; with a lot of refactoring we can remove the >> dof* completely, with a little refactoring we can bring it down to a single >> dof*local number of grid points. >> >> I cannot understand why you are seeing memory usage 7 times more than a >> vector. That seems like a lot. >> >> Barry >> >> On Oct 21, 2013, at 11:32 AM, Barry Smith <[email protected]> wrote: >>> The PETSc DMDA object greedily allocates several arrays of data used to >>> set up the communication and other things like local to global mappings >>> even before you create any vectors. This is why you see this big bump >>> in memory usage. >>> >>> BUT I don't think it should be any worse in 3.4 than in 3.3 or earlier; >>> at least we did not intend to make it worse. Are you sure it is using >>> more memory than in 3.3 >>> >>> In order for use to decrease the memory usage of the DMDA setup it would >>> be helpful if we knew which objects created within it used the most >>> memory. There is some sloppiness in that routine of not reusing memory >>> as well as could be, not sure how much difference that would make. >>> >>> >>> Barry >>> >>> On Oct 21, 2013, at 7:02 AM, Juha Jäykkä <[email protected]> wrote: >>>> Dear list members, >>>> >>>> I have noticed strange memory consumption after upgrading to 3.4 series. >>>> I >>>> never had time to properly investigate, but here is what happens [yes, >>>> this >>>> might be a petsc4py issue, but I doubt it] is >>>> >>>> # helpers contains _ProcessMemoryInfoProc routine which just digs the >>>> memory # usage data from /proc >>>> import helpers >>>> procdata=helpers._ProcessMemoryInfoProc() >>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1] >>>> from petsc4py import PETSc >>>> procdata=helpers._ProcessMemoryInfoProc() >>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1] >>>> da = PETSc.DA().create(sizes=[100,100,100], >>>> >>>> proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE], >>>> boundary_type=[3,0,0], >>>> stencil_type=PETSc.DA.StencilType.BOX, >>>> dof=7, stencil_width=1, comm=PETSc.COMM_WORLD) >>>> >>>> procdata=helpers._ProcessMemoryInfoProc() >>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1] >>>> vec=da.createGlobalVec() >>>> procdata=helpers._ProcessMemoryInfoProc() >>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1] >>>> >>>> outputs >>>> >>>> 48 MiB / 49348 kB >>>> 48 MiB / 49360 kB >>>> 381 MiB / 446228 kB >>>> 435 MiB / 446228 kB >>>> >>>> Which is odd: size of the actual data to be stored in the da is just >>>> about 56 megabytes, so why does creating the da consume 7 times that? >>>> And why does the DA reserve the memory in the first place? I thought >>>> memory only gets allocated once an associated vector is created and it >>>> indeed looks like the >>>> createGlobalVec call does indeed allocate the right amount of data. But >>>> what is that 330 MiB that DA().create() consumes? [It's actually the >>>> .setUp() method that does the consuming, but that's not of much use as >>>> it needs to be called before a vector can be created.] >>>> >>>> Cheers, >>>> Juha
