Ben, This conversation is getting long and winding. And we are are getting into your cluster adminstration - which is not PETSc related.
I'll sugest you figureout about using the cluster from your system admin and how to use bsub. http://www.vub.ac.be/BFUCC/LSF/man/bsub.1.html However I'll point out the following things. - I'll sugest learning about scheduling an interactive job on your cluster. This will help you with running multiple jobs on the same machine. - When making comparisions, have minimum changes between thing you compare runs. * For eg: you are comparing runs between different queues '-q linux64' '-q mcore_parallel'. There might be differences here that can result in different performance. * If you are getting part of the machine [for -n 1 jobs] - verify if you are sharing the other part with some other job. Without this verification - your numbers are not meaningful. [depending upon how the queue is configured - it can either allocate part of the node or full node] * you should be able to request 4procs [i.e 1 complete machine] but be able to run either -np 1, 2 or 4 on the allocation. [This is easier to do in interactive mode]. This ensures nobody else is using the machine. And you can run your code multiple times - to see if you are getting consistant results. Regarding the primary issue you've had - with performance debugging your PETSc appliation in *SMP-mode*, we've observed performance anamolies in your log_summary for both your code, and ex2.f.F This could be due one or more of the following: - issues in your code - issues with MPI you are using - isues with the cluster you are using. To narrow down - the comparisions I sugest: - compare my ex2f.F with the *exact* same runs on your machine [You've claimed that you also hav access to a 2-quad-core Intel Xeon X5355 machine]. So you should be able to reproduce the exact same experiment as me - and compare the results. This should keep both software same - and show differences in system software etc.. >>>>> ? No of Nodes Processors Qty per node Total cores per node Memory per node ? ? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ? ^^^ ? 60 Dual-Core Intel Xeon 5160 2 4 8 GB <<<<< i.e configure latest mpich2 with [default compilers gcc/gfortran]: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker Build PETSc with this MPI [and same compilers] ./config/configure.py --with-mpi-dir= --with-debugging=0 And run ex2f.F 600x600 on 1, 2, 4, 8 procs on a *single* X5355 machine. [it might have a different queue name] - Now compare ex2f.F performance wtih MPICH [as built above] and the current MPI you are using. This should identify the performance differences between MPI implemenations within the box [within the SMP box] - Now compare runs between ex2f.F and your application. At each of the above steps of comparision - we are hoping to identify the reason for differences and rectify. Perhaps this is not possible on your cluster and you can't improve on what you already have.. If you can't debug the SMP performance issues, you can avoid SMP completely, and use 1 MPI task per machine [or 1 MPI task per memory bank => 2 per machine]. But you'll still have to do similar analysis to make sure there are no performance anamolies in the tool chain. [i.e hardware, system software, MPI, application] If you are willing to do the above steps, we can help with the comparisions. As mentioned - this is getting long and windy. If you have futher questions in this regard - we should contiune it at petsc-maint at mcs.anl.gov Satish On Sat, 19 Apr 2008, Ben Tay wrote: > Hi Satish, > > 1st of all, I forgot to inform u that I've changed the m and n to 800. I would > like to see if the larger value can make the scaling better. If req, I can > redo the test with m,n=600. > > I can install MPICH but I don't think I can choose to run on a single machine > using from 1 to 8 procs. In order to run the code, I usually have to use the > command > > bsub -o log -q linux64 ./a.out for single procs > > bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no. > of procs. for multiple procs > > After that, when the job is running, I'll be given the server which my job > runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or > 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told > that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu. > > Btw, are you saying that I should 1st install the latest MPICH2 build with the > option : > > ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install > PETSc with the MPICH2? > > So after that do you know how to do what you've suggest for my servers? I > don't really understand what you mean. May I supposed to run 4 jobs on 1 > quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that > atlas3-c00 to c03 are the location of the quad cores. I can force to use them > by > > bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out > > Lastly, I make a mistake in the different times reported by the same compiler. > Sorry abt that. > > Thank you very much.