Andrej, a load average of 700 is very curious.
i guess you already made sure load average is zero when the system is idle ... are you running an hybrid app (e.g. MPI + OpenMP) ? one possible explanation is you run 48 MPI tasks and each task has 48 OpenMP threads, and that kills performances. when you log into the compute node, you can simply do ls -l /proc/<pid>/task to list the number of tasks there should be 2 or 3 threads (main thread + mpi threads that should be idle 99% of the time) at first, try to export OMP_NUM_THREADS=1 mpirun -np 48 a.out if that still does not work, try to mpirun -np 48 a.sh where a.sh is #!/bin/sh export OMP_NUM_THREADS=1 exec a.out $* for MPI profiling, you can try IPM https://www.nersc.gov/users/software/performance-and-debugging-tools/ipm/ (build the tool, relink your app with IPM, set environment variable and you will get a summary of how much time is spent in MPI and how many time it is called) for I/O profiling, there are tools like darshan, but i am not familiar with them. a simpler approach is to run both iostat 1 and nfsiostat 1 on the compute node and see how much io is performed i hope this helps Cheers, Gilles On Wed, Oct 21, 2015 at 12:08 PM, Andrej Prsa <[email protected]> wrote: > Hi Gilles, > > Thanks for your reply! > >> by "running on the head node", shall i understand you mean >> "running mpirun command *and* all mpi tasks on the head node" ? > > Precisely. > >> by "running on the compute node", shall i understand you mean >> "running mpirun on the compute node *and* all mpi tasks on the *same* >> compute node" ? > > What I meant to say is "running mpirun on the head node and all mpi > tasks on a single worker node", i.e. > > mpirun --hostfile node1 -np 48 --byslot ./mpitest.py > > where node1 contains: > > node1 slots=48 max_slots=48 > > I then ssh'd to node1 and ran htop to observe that processes are all > colored in red, i.e. kernel cycles. A curious observation is the load > average, which shows 700+, i.e. ridiculous given that there are only 48 > processors. > >> what i mean is that if your mpi program is very sensitive to >> latency/bandwidth, TCP over a gigabit ethernet is way slower that >> shared memory, and that can explain the 20x drop in performance. > > Is there a good way to profile this? I installed an evaluation version > of vampirtrace, but I didn't yet get so far to produce a trace with > mpirun. > >> ideally, i would recommend you first try using openmpi v1.8 or the >> latest v1.10.1rc2. > > I tried 1.8.2 and needed to use --bind-to none to get the cores to even > run (with Jeff Squyres' help), most likely because of this L3 bios bug. > I'll try 1.10.1rc2 tomorrow and report back. > >> btw, is your head node running on similar hardware than compute nodes >> (e.g. 4 sockets 48 cores) ? > > Yes, all nodes are identical, 4 sockets, 48 cores. > >> are you using a batch manager to run on compute nodes ? >> if yes, did you check all tasks are running on different cores ? >> (depending on your configuration and how you submit your job, all mpi >> tasks can endup sharing the same core ...) > > I do have torque/maui installed and configured, but I disabled them for > these tests, and running mpirun directly. > >> you reported a lot of time is spent in the kernel, is this time spent >> for MPI communication ? >> (it could be for NFS access since you have no local disk on the >> compute nodes) > > I'd *love* to be able to answer that question, but I don't know how to > find that out. If you can point me in the right direction, I'd love to > learn how to do it. > >> btw, how many MPI tasks are you running ? > > I ran it on all 48 cores per node. > >> if your application can run on a small number of tasks, here is what >> you can do from your head node : >> >> mpirun -np 1 a.out >> mpirun -np 1 -host <compute> a.out >> run a job with one task on the compute node >> if there a performance drop ? if yes, when ? >> >> try again with -np 2 (using one host) >> if there a performance drop ? if yes, when ? >> >> and then try with two nodes and one task per node >> mpirun -np 2 -host <compute1,compute2> a.out > > It'll take a bit of code modification, but let me try it and report > back. > > Thanks a lot, > Andrej > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18224.php
