Thanks. Same version I tried.
> On Jan 18, 2024, at 6:09 PM, Yesypenko, Anna <[email protected]> wrote: > > Hi Barry, > > I'm using version 3.20.3. The tacc system is lonestar6. > > Best, > Anna > From: Barry Smith <[email protected] <mailto:[email protected]>> > Sent: Thursday, January 18, 2024 4:43 PM > To: Yesypenko, Anna <[email protected] <mailto:[email protected]>> > Cc: [email protected] <mailto:[email protected]> > <[email protected] <mailto:[email protected]>>; Victor Eijkhout > <[email protected] <mailto:[email protected]>> > Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix > > > Ok, I ran it on an ANL machine with CUDA and it worked fine for many runs, > even increased the problem size without producing any problems. Both versions > of the Python code. > > Anna, > > What version of PETSc are you using? > > Victor, > > Does anyone at ANL have access to this TACC system to try to reproduce? > > > Barry > > > >> On Jan 18, 2024, at 4:38 PM, Barry Smith <[email protected] >> <mailto:[email protected]>> wrote: >> >> >> It is using the hash map system for inserting values which only inserts >> on the CPU, not on the GPU. So I don't see that it would be moving any data >> to the GPU until the mat assembly() is done which it never gets to. Hence I >> have trouble understanding why the GPU has anything to do with the crash. >> >> I guess I need to try to reproduce it on a GPU system. >> >> Barry >> >> >> >> >>> On Jan 18, 2024, at 4:28 PM, Matthew Knepley <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> On Thu, Jan 18, 2024 at 4:18 PM Yesypenko, Anna <[email protected] >>> <mailto:[email protected]>> wrote: >>> Hi Matt, Barry, >>> >>> Apologies for the extra dependency on scipy. I can replicate the error by >>> calling setValue (i,j,v) in a loop as well. >>> In roughly half of 10 runs, the following script fails because of an error >>> in hashmapijv – the same as my original post. >>> It successfully runs without error the other times. >>> >>> Barry is right that it's CUDA specific. The script runs fine on the CPU. >>> Do you have any suggestions or example scripts on assigning entries to a >>> AIJCUSPARSE matrix? >>> >>> Oh, you definitely do not want to be doing this. I believe you would rather >>> >>> 1) Make the CPU matrix and then convert to AIJCUSPARSE. This is efficient. >>> >>> 2) Produce the values on the GPU and call >>> >>> https://petsc.org/main/manualpages/Mat/MatSetPreallocationCOO/ >>> https://petsc.org/main/manualpages/Mat/MatSetValuesCOO/ >>> >>> This is what most people do who are forming matrices directly on the GPU. >>> >>> What you are currently doing is incredibly inefficient, and I think >>> accounts for you running out of memory. >>> It talks back and forth between the CPU and GPU. >>> >>> Thanks, >>> >>> Matt >>> >>> Here is a minimum snippet that doesn't depend on scipy. >>> ``` >>> from petsc4py import PETSc >>> import numpy as np >>> >>> n = int(5e5); >>> nnz = 3 * np.ones(n, dtype=np.int32) >>> nnz[0] = nnz[-1] = 2 >>> A = PETSc.Mat(comm=PETSc.COMM_WORLD) >>> A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz) >>> A.setType('aijcusparse') >>> >>> A.setValue(0, 0, 2) >>> A.setValue(0, 1, -1) >>> A.setValue(n-1, n-2, -1) >>> A.setValue(n-1, n-1, 2) >>> >>> for index in range(1, n - 1): >>> A.setValue(index, index - 1, -1) >>> A.setValue(index, index, 2) >>> A.setValue(index, index + 1, -1) >>> A.assemble() >>> ``` >>> If it means anything to you, when the hash error occurs, it is for index >>> 67283 after filling 201851 nonzero values. >>> >>> Thank you for your help and suggestions! >>> Anna >>> >>> From: Barry Smith <[email protected] <mailto:[email protected]>> >>> Sent: Thursday, January 18, 2024 2:35 PM >>> To: Yesypenko, Anna <[email protected] <mailto:[email protected]>> >>> Cc: [email protected] <mailto:[email protected]> >>> <[email protected] <mailto:[email protected]>> >>> Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix >>> >>> >>> Do you ever get a problem with 'aij` ? Can you run in a loop with >>> 'aij' to confirm it doesn't fail then? >>> >>> >>> >>> Barry >>> >>> >>>> On Jan 17, 2024, at 4:51 PM, Yesypenko, Anna <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Dear Petsc users/developers, >>>> >>>> I'm experiencing a bug when using petsc4py with GPU support. It may be my >>>> mistake in how I set up a AIJCUSPARSE matrix. >>>> For larger matrices, I sometimes encounter a error in assigning matrix >>>> values; the error is thrown in PetscHMapIJVQuerySet(). >>>> Here is a minimum snippet that populates a sparse tridiagonal matrix. >>>> >>>> ``` >>>> from petsc4py import PETSc >>>> from scipy.sparse import diags >>>> import numpy as np >>>> >>>> n = int(5e5); >>>> >>>> nnz = 3 * np.ones(n, dtype=np.int32); nnz[0] = nnz[-1] = 2 >>>> A = PETSc.Mat(comm=PETSc.COMM_WORLD) >>>> A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz) >>>> A.setType('aijcusparse') >>>> tmp = diags([-1,2,-1],[-1,0,+1],shape=(n,n)).tocsr() >>>> A.setValuesCSR(tmp.indptr,tmp.indices,tmp.data) >>>> ####### this is the line where the error is thrown. >>>> A.assemble() >>>> ``` >>>> >>>> The error trace is below: >>>> ``` >>>> File "petsc4py/PETSc/Mat.pyx", line 2603, in >>>> petsc4py.PETSc.Mat.setValuesCSR >>>> File "petsc4py/PETSc/petscmat.pxi", line 1039, in >>>> petsc4py.PETSc.matsetvalues_csr >>>> File "petsc4py/PETSc/petscmat.pxi", line 1032, in >>>> petsc4py.PETSc.matsetvalues_ijv >>>> petsc4py.PETSc.Error: error code 76 >>>> [0] MatSetValues() at >>>> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:1497 >>>> [0] MatSetValues_Seq_Hash() at >>>> /work/06368/annayesy/ls6/petsc/include/../src/mat/impls/aij/seq/seqhashmatsetvalues.h:52 >>>> [0] PetscHMapIJVQuerySet() at >>>> /work/06368/annayesy/ls6/petsc/include/petsc/private/hashmapijv.h:10 >>>> [0] Error in external library >>>> [0] [khash] Assertion: `ret >= 0' failed. >>>> ``` >>>> >>>> If I run the same script a handful of times, it will run without errors >>>> eventually. >>>> Does anyone have insight on why it is behaving this way? I'm running on a >>>> node with 3x NVIDIA A100 PCIE 40GB. >>>> >>>> Thank you! >>>> Anna >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
