I seem to have had a classical deadlock, A was being assembled while some threads lurked around elsewhere. Adding some barriers seems to fix the problem, at least with the cases I currently have.
What I still miss is what would be the advantage of MPI_Barrier(((PetscObject)A)->comm) over MPI_Barrier(PETSC_COMM_WORLD). Many thanks Dominik On Fri, Aug 26, 2011 at 11:01 AM, Matthew Knepley <knepley at gmail.com> wrote: > On Fri, Aug 26, 2011 at 8:37 AM, Dominik Szczerba <dominik at itis.ethz.ch> > wrote: >> >> > When you run in the debugger and break after it has obviously hung, are >> > all >> > processes stopped at the same place? >> >> Of course not, they are stuck at barriers elsewhere. Thanks for the >> valuable question. >> >> > If you see an error condition, you can >> > run >> > CHKMEMQ; >> > MPI_Barrier(((PetscObject)A)->comm); >> > MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); >> > MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY); >> > If it hangs, check where every process is stuck. >> >> I obviously seem to be missing some barriers. But why would I need >> MPI_Barrier(((PetscObject)A)->comm) and not just >> MPI_Barrier(PETSC_COMM_WORLD)? Would that only force a barrier for >> A-related traffic? > > The idea here is the following: > ? 1) We would like to isolate the mismatch in synchronizations > ? 2) We can place barriers in the code to delimit the sections which contain > the offending code, > ? ? ? ?and also eliminate bugs in MatAssembly as a possible source of > problems. > ? 3) Do you have any MPI code you wrote yourself in here? > ? ? ?Matt > >> >> Dominik > > -- > What most experimenters take for granted before they begin their experiments > is infinitely more interesting than any results to which their experiments > lead. > -- Norbert Wiener >
