Hi Barry,

I'm using version 3.20.3. The tacc system is lonestar6.

Best,
Anna
________________________________
From: Barry Smith <[email protected]>
Sent: Thursday, January 18, 2024 4:43 PM
To: Yesypenko, Anna <[email protected]>
Cc: [email protected] <[email protected]>; Victor Eijkhout 
<[email protected]>
Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix


   Ok, I ran it on an ANL machine with CUDA and it worked fine for many runs, 
even increased the problem size without producing any problems. Both versions 
of the Python code.

   Anna,

   What version of PETSc are you using?

   Victor,

   Does anyone at ANL have access to this TACC system to try to reproduce?


  Barry



On Jan 18, 2024, at 4:38 PM, Barry Smith <[email protected]> wrote:


   It is using the hash map system for inserting values which only inserts on 
the CPU, not on the GPU. So I don't see that it would be moving any data to the 
GPU until the mat assembly() is done which it never gets to. Hence I have 
trouble understanding why the GPU has anything to do with the crash.

   I guess I need to try to reproduce it on a GPU system.

   Barry




On Jan 18, 2024, at 4:28 PM, Matthew Knepley <[email protected]> wrote:

On Thu, Jan 18, 2024 at 4:18 PM Yesypenko, Anna 
<[email protected]<mailto:[email protected]>> wrote:
Hi Matt, Barry,

Apologies for the extra dependency on scipy. I can replicate the error by 
calling setValue (i,j,v) in a loop as well.
In roughly half of 10 runs, the following script fails because of an error in 
hashmapijv – the same as my original post.
It successfully runs without error the other times.

Barry is right that it's CUDA specific. The script runs fine on the CPU.
Do you have any suggestions or example scripts on assigning entries to a 
AIJCUSPARSE matrix?

Oh, you definitely do not want to be doing this. I believe you would rather

1) Make the CPU matrix and then convert to AIJCUSPARSE. This is efficient.

2) Produce the values on the GPU and call

  https://petsc.org/main/manualpages/Mat/MatSetPreallocationCOO/
  https://petsc.org/main/manualpages/Mat/MatSetValuesCOO/

  This is what most people do who are forming matrices directly on the GPU.

What you are currently doing is incredibly inefficient, and I think accounts 
for you running out of memory.
It talks back and forth between the CPU and GPU.

  Thanks,

     Matt

Here is a minimum snippet that doesn't depend on scipy.
```
from petsc4py import PETSc
import numpy as np

n = int(5e5);
nnz = 3 * np.ones(n, dtype=np.int32)
nnz[0] = nnz[-1] = 2
A = PETSc.Mat(comm=PETSc.COMM_WORLD)
A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
A.setType('aijcusparse')

A.setValue(0, 0, 2)
A.setValue(0, 1, -1)
A.setValue(n-1, n-2, -1)
A.setValue(n-1, n-1, 2)

for index in range(1, n - 1):
         A.setValue(index, index - 1, -1)
         A.setValue(index, index, 2)
         A.setValue(index, index + 1, -1)
A.assemble()
```
If it means anything to you, when the hash error occurs, it is for index 67283 
after filling 201851 nonzero values.

Thank you for your help and suggestions!
Anna

________________________________
From: Barry Smith <[email protected]<mailto:[email protected]>>
Sent: Thursday, January 18, 2024 2:35 PM
To: Yesypenko, Anna <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] HashMap Error when populating AIJCUSPARSE matrix


   Do you ever get a problem with 'aij` ?   Can you run in a loop with 'aij' to 
confirm it doesn't fail then?



   Barry


On Jan 17, 2024, at 4:51 PM, Yesypenko, Anna 
<[email protected]<mailto:[email protected]>> wrote:

Dear Petsc users/developers,

I'm experiencing a bug when using petsc4py with GPU support. It may be my 
mistake in how I set up a AIJCUSPARSE matrix.
For larger matrices, I sometimes encounter a error in assigning matrix values; 
the error is thrown in PetscHMapIJVQuerySet().
Here is a minimum snippet that populates a sparse tridiagonal matrix.

```
from petsc4py import PETSc
from scipy.sparse import diags
import numpy as np

n = int(5e5);

nnz = 3 * np.ones(n, dtype=np.int32); nnz[0] = nnz[-1] = 2
A = PETSc.Mat(comm=PETSc.COMM_WORLD)
A.createAIJ(size=[n,n],comm=PETSc.COMM_WORLD,nnz=nnz)
A.setType('aijcusparse')
tmp = diags([-1,2,-1],[-1,0,+1],shape=(n,n)).tocsr()
A.setValuesCSR(tmp.indptr,tmp.indices,tmp.data)                            
####### this is the line where the error is thrown.
A.assemble()
```

The error trace is below:
```
File "petsc4py/PETSc/Mat.pyx", line 2603, in petsc4py.PETSc.Mat.setValuesCSR
  File "petsc4py/PETSc/petscmat.pxi", line 1039, in 
petsc4py.PETSc.matsetvalues_csr
  File "petsc4py/PETSc/petscmat.pxi", line 1032, in 
petsc4py.PETSc.matsetvalues_ijv
petsc4py.PETSc.Error: error code 76
[0] MatSetValues() at 
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:1497
[0] MatSetValues_Seq_Hash() at 
/work/06368/annayesy/ls6/petsc/include/../src/mat/impls/aij/seq/seqhashmatsetvalues.h:52
[0] PetscHMapIJVQuerySet() at 
/work/06368/annayesy/ls6/petsc/include/petsc/private/hashmapijv.h:10
[0] Error in external library
[0] [khash] Assertion: `ret >= 0' failed.
```

If I run the same script a handful of times, it will run without errors 
eventually.
Does anyone have insight on why it is behaving this way? I'm running on a node 
with 3x NVIDIA A100 PCIE 40GB.

Thank you!
Anna



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>


Reply via email to