What's your minimal reproducer?

"Mills, Richard Tran via petsc-dev" <[email protected]> writes:

> On 2/21/19 1:48 PM, Smith, Barry F. wrote:
>
>
>
>
>
> On Feb 20, 2019, at 12:50 PM, Mills, Richard Tran via petsc-dev 
> <[email protected]><mailto:[email protected]> wrote:
>
> Folks,
>
> I'm working with some PFLOTRAN examples, and I've been scratching my head for 
> a while at errors that are being generated in MatSetValuesBlockedLocal(), 
> related to new nonzeros requiring a malloc. E.g.,
>
> [0]PETSC ERROR: Argument out of range
> [0]PETSC ERROR: New nonzero at (2,0) caused a malloc
>
> The strange thing is that all of these examples run fine when I use the 
> PFLOTRAN default BAIJ matrix type (with blocks column-oriented) -- I only see 
> the issue when I switch to using AIJ. If I set the matrix option 
> MAT_NEW_NONZERO_ALLOCATION_ERR to PETSC_FALSE to allow things to run with 
> AIJ, everything seems to work perfectly. (And once the Jacobian matrix is 
> constructed the first time, the memory usage is not growing according to what 
> I see from -malloc_log.)
>
> I've spent some time poking around with a debugger in the variants of 
> MatSetValues() and MatXAIJSetPreallocation(), and I haven't been able make 
> much progress in determining why I'm seeing this in AIJ vs. BAIJ -- it's easy 
> to get confused looking at this code, and it's going to take me more time to 
> follow what is going on than I've had to spare so far. So, I've got a few 
> questions. First, should it even be possible to see what I am seeing? That 
> is, if the MatSetValuesBlocked() routine is not causing new allocations when 
> using BAIJ, should this be possible with AIJ?
>
>
>
>
>     Sure, because the preallocation for BAIJ matrix is by block, while for 
> AIJ it is by point (even if you provide a block size to AIJ). So long as the 
> new entry is within an allocated block it will not trigger an error with BAIJ 
> but may with AIJ.
>
> Hmm. It seems that I've misunderstood what MatXAIJSetPreallocation() is 
> supposed to do. I thought that if I pass in a blocksize > 1, 
> MatXAIJSetPreallocation() will turn the block-row preallocation into the 
> equivalent scalar-row stuff. This seems to be what is happening in the code, 
> unless I'm reading this wrong:
>
> [...]
> 304:     } else {                    /* Convert block-row precallocation to 
> scalar-row */
> 305:       PetscInt i,m,*sdnnz,*sonnz;
> 306:       MatGetLocalSize(A,&m,NULL);
> 307:       PetscMalloc2((!!dnnz)*m,&sdnnz,(!!onnz)*m,&sonnz);
> 308:       for (i=0; i<m; i++) {
> 309:         if (dnnz) sdnnz[i] = dnnz[i/bs] * bs;
> 310:         if (onnz) sonnz[i] = onnz[i/bs] * bs;
> 311:       }
> 312:       MatSeqAIJSetPreallocation(A,0,dnnz ? sdnnz : NULL);
> 313:       MatMPIAIJSetPreallocation(A,0,dnnz ? sdnnz : NULL,0,onnz ? sonnz : 
> NULL);
> [...]
>
> Though might something be happening later that reduces the amount of nonzeros 
> allocated, since the "dense" blocks of the matrix actually will have zero 
> entries that never get a value set?
>
> Something does appear to be screwy when I look at the "-snes_view" output for 
> a time-dependent problem where I see these errors. When using BAIJ, at every 
> time step I see
>
>     Mat Object: Mat_0x84000000_0 (flow_) 1 MPI processes
>       type: seqbaij
>       rows=144, cols=144, bs=3
>       total: nonzeros=2304, allocated nonzeros=2304
>       total number of mallocs used during MatSetValues calls =0
>           block size is 3
>
> but when I switch to AIJ, I initially see
>
>     Mat Object: Mat_0x84000000_0 (flow_) 1 MPI processes
>       type: seqaij
>       rows=144, cols=144, bs=3
>       total: nonzeros=1584, allocated nonzeros=3324
>       total number of mallocs used during MatSetValues calls =68
>         using I-node routines: found 96 nodes, limit used is 5
>
> but as things progress, the allocated nonzeros and mallocs count keeps 
> growing; by step 43 I'm seeing
>
>       total: nonzeros=1584, allocated nonzeros=115524
>       total number of mallocs used during MatSetValues calls =7548
>
> If I dump the matrices and look at them in something like Octave, the 
> matrices from the BAIJ and AIJ runs appear identical within round-off error, 
> and we certainly aren't adding any entries at each step. Why should PETSc 
> keep calling malloc() at each MatSetValues() call, then? I suspect a bug 
> somewhere in the AIJ case call sequence of MatSetValuesBlockedLocal() --> 
> MatSetValuesBlocked() --> MatSetValues() --> MatSetValues_SeqAIJ(). I have 
> trouble following all of the logic in matsetvaluesseqaij_(), though, so I 
> guess I'm going to have try harder to understand why 
> MatSeqXAIJReallocateAIJ() is being hit there.
>
> Any educated guesses about what might be going wrong would be appreciated. 
> There is a lot of PETSc code I'm not familiar with involved here.
>
> Cheers,
> Richard
>
>
>
>
>
>    Barry
>
>
>
> (Clearly it *is* possible, but should it be?) I'm still trying to figure out 
> if there is something wrong with what PFLOTRAN is doing, vs. something going 
> wrong somewhere inside PETSc.
>
> Any hints about what to look at from someone with more familiarity with this 
> code would be appreciated.
>
> --Richard

Reply via email to