Actually, here's a short test case I just made. I have it on a git repo: https://github.com/s769/petsc-test
I put some instructions for how to build and run, but if there are issues, please let me know. In this small test code, I noticed that there are some CUDA memory errors in the VecAXPY() line if the proc_cols variable is not 1. Still trying to figure out what might be causing that, but in the meantime, the code I have up there hangs for proc_rows=3, proc_cols=1, n=10 when we try to get the norm of the Vec. Hope this helps. Thanks, Sreeram On Thu, Nov 16, 2023 at 8:38 PM Sreeram R Venkat <[email protected]> wrote: > Ok, will do. It may take me a few days to get a minimal reproducible > example though since the rest of the program has gotten quite large. > > Thanks, > Sreeram > > On Thu, Nov 16, 2023 at 8:27 PM Matthew Knepley <[email protected]> wrote: > >> On Thu, Nov 16, 2023 at 6:19 PM Sreeram R Venkat <[email protected]> >> wrote: >> >>> I have a program which reads a vector from file into an array, and then >>> uses that array to create a PETSc Vec object. The Vec is defined on the >>> global communicator, but not all processes actually contain entries of it. >>> For example, suppose we have 4 processors, and the vector is of size 10. >>> Rank 0 will contain entries 0-4 and Rank 1 will contain entries 5-9. Ranks >>> 2 and 3 will not have any entries of the Vec. >>> >>> This Vec is then used as an input to other parts of the code, and those >>> work fine. However, if I try to take the norm of the Vec with VecNorm(), I >>> get the error >>> >>> `MPI_Allreduce() called in different locations (code lines) on different >>> processors` >>> >>> The stack trace shows that ranks 0 and 1 (from the above example) are >>> still in the VecNorm() function while ranks 2 and 3 have moved on to a >>> later part of the code. If I add a PetscBarrier() after the VecNorm(), I >>> find that the program hangs. >>> >>> The funny thing is that part of the code duplicates the Vec with >>> VecDuplicate() and assigns to the duplicated vector the result of some >>> computations. The duplicated Vec has the same layout as the original Vec, >>> but taking VecNorm() on the duplicated Vec works fine. If I use VecCopy(), >>> however, the copied Vec also causes VecNorm() to hang. I've printed out the >>> original Vec, and there are no corrupted/NaN entries. >>> >>> I have a temporary workaround where I perturb the original Vec slightly >>> before copying it to another Vec. This causes the program to successfully >>> terminate. >>> >>> Any advice on how to get VecNorm() working with the original Vec? >>> >> >> Vecs with empty layouts work fine, so it must be something else about how >> it is created. >> >> In order to track it down, I would first make a short program that just >> creates the Vec as you say and see if it hangs. If so, just send it and we >> will debug it. If not, I would systematically cut down your program until >> you get something that hangs that you can send to us. >> >> Thanks, >> >> Matt >> >> >>> Thanks, >>> Sreeram >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> <http://www.cse.buffalo.edu/~knepley/> >> >
