What's your minimal reproducer? "Mills, Richard Tran via petsc-dev" <[email protected]> writes:
> On 2/21/19 1:48 PM, Smith, Barry F. wrote: > > > > > > On Feb 20, 2019, at 12:50 PM, Mills, Richard Tran via petsc-dev > <[email protected]><mailto:[email protected]> wrote: > > Folks, > > I'm working with some PFLOTRAN examples, and I've been scratching my head for > a while at errors that are being generated in MatSetValuesBlockedLocal(), > related to new nonzeros requiring a malloc. E.g., > > [0]PETSC ERROR: Argument out of range > [0]PETSC ERROR: New nonzero at (2,0) caused a malloc > > The strange thing is that all of these examples run fine when I use the > PFLOTRAN default BAIJ matrix type (with blocks column-oriented) -- I only see > the issue when I switch to using AIJ. If I set the matrix option > MAT_NEW_NONZERO_ALLOCATION_ERR to PETSC_FALSE to allow things to run with > AIJ, everything seems to work perfectly. (And once the Jacobian matrix is > constructed the first time, the memory usage is not growing according to what > I see from -malloc_log.) > > I've spent some time poking around with a debugger in the variants of > MatSetValues() and MatXAIJSetPreallocation(), and I haven't been able make > much progress in determining why I'm seeing this in AIJ vs. BAIJ -- it's easy > to get confused looking at this code, and it's going to take me more time to > follow what is going on than I've had to spare so far. So, I've got a few > questions. First, should it even be possible to see what I am seeing? That > is, if the MatSetValuesBlocked() routine is not causing new allocations when > using BAIJ, should this be possible with AIJ? > > > > > Sure, because the preallocation for BAIJ matrix is by block, while for > AIJ it is by point (even if you provide a block size to AIJ). So long as the > new entry is within an allocated block it will not trigger an error with BAIJ > but may with AIJ. > > Hmm. It seems that I've misunderstood what MatXAIJSetPreallocation() is > supposed to do. I thought that if I pass in a blocksize > 1, > MatXAIJSetPreallocation() will turn the block-row preallocation into the > equivalent scalar-row stuff. This seems to be what is happening in the code, > unless I'm reading this wrong: > > [...] > 304: } else { /* Convert block-row precallocation to > scalar-row */ > 305: PetscInt i,m,*sdnnz,*sonnz; > 306: MatGetLocalSize(A,&m,NULL); > 307: PetscMalloc2((!!dnnz)*m,&sdnnz,(!!onnz)*m,&sonnz); > 308: for (i=0; i<m; i++) { > 309: if (dnnz) sdnnz[i] = dnnz[i/bs] * bs; > 310: if (onnz) sonnz[i] = onnz[i/bs] * bs; > 311: } > 312: MatSeqAIJSetPreallocation(A,0,dnnz ? sdnnz : NULL); > 313: MatMPIAIJSetPreallocation(A,0,dnnz ? sdnnz : NULL,0,onnz ? sonnz : > NULL); > [...] > > Though might something be happening later that reduces the amount of nonzeros > allocated, since the "dense" blocks of the matrix actually will have zero > entries that never get a value set? > > Something does appear to be screwy when I look at the "-snes_view" output for > a time-dependent problem where I see these errors. When using BAIJ, at every > time step I see > > Mat Object: Mat_0x84000000_0 (flow_) 1 MPI processes > type: seqbaij > rows=144, cols=144, bs=3 > total: nonzeros=2304, allocated nonzeros=2304 > total number of mallocs used during MatSetValues calls =0 > block size is 3 > > but when I switch to AIJ, I initially see > > Mat Object: Mat_0x84000000_0 (flow_) 1 MPI processes > type: seqaij > rows=144, cols=144, bs=3 > total: nonzeros=1584, allocated nonzeros=3324 > total number of mallocs used during MatSetValues calls =68 > using I-node routines: found 96 nodes, limit used is 5 > > but as things progress, the allocated nonzeros and mallocs count keeps > growing; by step 43 I'm seeing > > total: nonzeros=1584, allocated nonzeros=115524 > total number of mallocs used during MatSetValues calls =7548 > > If I dump the matrices and look at them in something like Octave, the > matrices from the BAIJ and AIJ runs appear identical within round-off error, > and we certainly aren't adding any entries at each step. Why should PETSc > keep calling malloc() at each MatSetValues() call, then? I suspect a bug > somewhere in the AIJ case call sequence of MatSetValuesBlockedLocal() --> > MatSetValuesBlocked() --> MatSetValues() --> MatSetValues_SeqAIJ(). I have > trouble following all of the logic in matsetvaluesseqaij_(), though, so I > guess I'm going to have try harder to understand why > MatSeqXAIJReallocateAIJ() is being hit there. > > Any educated guesses about what might be going wrong would be appreciated. > There is a lot of PETSc code I'm not familiar with involved here. > > Cheers, > Richard > > > > > > Barry > > > > (Clearly it *is* possible, but should it be?) I'm still trying to figure out > if there is something wrong with what PFLOTRAN is doing, vs. something going > wrong somewhere inside PETSc. > > Any hints about what to look at from someone with more familiarity with this > code would be appreciated. > > --Richard
