[petsc-users] problem of running jobs on cluster

Barry Smith Mon, 24 Oct 2011 15:52:43 -0500


On Oct 24, 2011, at 3:37 PM, Wen Jiang wrote:


> Hi guys,
> 
> I reported this problem a few days ago but I still cannot get it fixed. Right 
> now I am learning how to debug the parallel code. And I just want to get some 
> suggestions before I figure out how the debugger works.
> 
> This is just a big run of my own fem code, which has almost the same 
> structure as the ex3 in ksp examples. This code ( the largest dof I used is 
> around 65,000 ) is running totally fine on one compute node with any number 
> of processes.  And the code with smaller dof ( less than 5000) is also 
> working fine on more than one compute node. However, I am encountering a 
> problem when I tries to run a large job ( for example, dof = 10,000 ) on two 
> compute nodes. 
> 
> The problem is that my code will get stuck at the MatAssemblyEnd() stage. I 
> use the option -info to print information about the code and find that only 
> some of the processes gives the MatAssemblyEnd_SeqAIJ() information and thus 
> the code gets stuck there.
> 
> I have several questions here,
> 
> 1. In ex3, the comments said that the matrix is intentionally laid out across 
> processors differently from the way it is assembled. As far as I understand, 
> this means that the MatSetValues() will insert the values to different 
> processors.( am I correct?). Since generating the entries on the 'wrong' 
> process is expensive, I am just wondering whether there is a better way to do 
> it especially for the assembly the global stiffness matrix in FEM. ( In my 
> code, the MatSetValues will add a 64 by 64 element stiffness matrix every 
> time )
> 
> 2. Since my code (dofs around 10,000 ) is working fine on single node but get 
> stuck on two nodes, I am guessing that might be due to the large chuck of 
> data which needs to be communicated between different nodes in the stage of 
> MatAssembly ? Will the data communication be slower between different nodes 
> than within single node?  

   Absolutely. You want to generate most of the matrix entries on the process 
where they will be stored. You also need to make sure you've done the correct 
matrix preallocation: 
http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#efficient-assembly 
Working for small problem and taking "forever" for larger problem is a sign of 
bad preallocation or too much data computed on the wrong process.

   Also run the smaller problem with valgrind 
http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind to make 
sure there are no memory corruption problems that are slipping by on the small 
mesh but causing problems on the large.

    Also run the small problem and check for correct memory preallocation; if 
it is wrong for the small problem it will be wrong for the large.

  Barry

> 
> I appreciate any of your suggestion and I will also keep working on the 
> debugging. 
> 
> Thanks,
> Wen

[petsc-users] problem of running jobs on cluster

Reply via email to