Hi guys, I reported this problem a few days ago but I still cannot get it fixed. Right now I am learning how to debug the parallel code. And I just want to get some suggestions before I figure out how the debugger works.
This is just a big run of my own fem code, which has almost the same structure as the ex3 in ksp examples. This code ( the largest dof I used is around 65,000 ) is running totally fine on one compute node with any number of processes. And the code with smaller dof ( less than 5000) is also working fine on more than one compute node. However, I am encountering a problem when I tries to run a large job ( for example, dof = 10,000 ) on two compute nodes. The problem is that my code will get stuck at the MatAssemblyEnd() stage. I use the option -info to print information about the code and find that only some of the processes gives the MatAssemblyEnd_SeqAIJ() information and thus the code gets stuck there. I have several questions here, 1. In ex3, the comments said that the matrix is intentionally laid out across processors differently from the way it is assembled. As far as I understand, this means that the MatSetValues() will insert the values to different processors.( am I correct?). Since generating the entries on the 'wrong' process is expensive, I am just wondering whether there is a better way to do it especially for the assembly the global stiffness matrix in FEM. ( In my code, the MatSetValues will add a 64 by 64 element stiffness matrix every time ) 2. Since my code (dofs around 10,000 ) is working fine on single node but get stuck on two nodes, I am guessing that might be due to the large chuck of data which needs to be communicated between different nodes in the stage of MatAssembly ? Will the data communication be slower between different nodes than within single node? I appreciate any of your suggestion and I will also keep working on the debugging. Thanks, Wen -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111024/533dbd08/attachment.htm>
