Dear awesome community,
Over the last months, I closely followed the evolution of bug 2043, entitled 'sm BTL hang with GCC 4.4.x'. https://svn.open-mpi.org/trac/ompi/ticket/2043 The reason is that I am developping an MPI-based software, and I use Open-MPI as it is the only implementation I am aware of that send messages eagerly (powerful feature, that is). http://denovoassembler.sourceforge.net/ I believe that this very pesky bug remains in Open-MPI 1.4.3, and enclosed to this communication are scientific proofs of my claim, or at least I think they are ;). Each byte transfer layer has its default limit to send eagerly a message. With shared memory (sm), the value is 4096 bytes. At least it is according to ompi_info. To verify this limit, I implemented a very simple test. The source code is test4096.cpp, which basically just send a single message of 4096 bytes from a rank to another (rank 1 to 0). The test was conclusive: the limit is 4096 bytes (see mpirun-np-2-Simple.txt). Then, I implemented a simple program (103 lines) that makes Open-MPI 1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each rank send a message to a randomly-selected destination. A rank polls for new messages with MPI_Iprobe. Each rank prints the current time at each second during 30 seconds. Using this simple code, I ran 4 test cases, each with a different outcome (use the Makefile if you want to reproduce the bug). Before I describe these cases, I will describe the testing hardware. I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). The computer has 128 GB of physical memory (see cat-proc-meminfo.txt.gz). It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see dmesg.txt.gz & uname.txt). Default kernel parameters are utilized at runtime (see sudo-sysctl-a.txt.gz). The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g ++--version.txt). I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz, make-install.out.gz). Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz). Now I can describe the cases. Case 1: 30 MPI ranks, message size is 4096 bytes File: mpirun-np-30-Program-4096.txt Outcome: It hangs -- I killed the poor thing after 30 seconds or so. Case 2: 30 MPI ranks, message size is 1 byte File: mpirun-np-30-Program-1.txt.gz Outcome: It runs just fine. Case 3: 2 MPI ranks, message size is 4096 bytes File: mpirun-np-2-Program-4096.txt Outcome: It hangs -- I killed the poor thing after 30 seconds or so. Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is disabled File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz Outcome: It runs just fine. A backtrace of the processes in Case 1 is in gdb-bt.txt.gz. Thank you !
#include<mpi.h> #include<iostream> using namespace std; int main(int argc,char**argv){ int rank; int size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); cout<<"Rank "<<rank<<" welcomes you."<<endl; if(rank==0){ char incoming[4096]; MPI_Status status; MPI_Recv(incoming,4096,MPI_BYTE,1,0,MPI_COMM_WORLD,&status); }else if(rank==1){ char data[4096]; MPI_Send(data,4096,MPI_BYTE,0,0,MPI_COMM_WORLD); } cout<<"Rank "<<rank<<" thanks you."<<endl; MPI_Finalize(); }
Rank 1 welcomes you. Rank 0 welcomes you. Rank 1 thanks you. Rank 0 thanks you.
/* * Author: Sébastien Boisvert * Université Laval * * sample code to make openmpi-1.4.3 hang * * excluding the shared memory solves the problem. * * see Makefile * * */ #include<mpi.h> #include<stdlib.h> #include<time.h> #include<stdio.h> #include<stdint.h> #include<iostream> using namespace std; class Rank{ int m_rank; int m_size; time_t m_startingPoint; int m_messageSize; void run(); void receiveMessages(); void sendMessages(); int getRank(); int getSize(); bool isAlive(); public: Rank(int argc,char**argv); }; int Rank::getSize(){ return m_size; } bool Rank::isAlive(){ int duration=30; return time(NULL)-m_startingPoint<duration; } void Rank::receiveMessages(){ int flag; MPI_Status status; MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&flag,&status); while(flag){ int length; MPI_Get_count(&status,MPI_BYTE,&length); uint8_t incoming[4096]; MPI_Status status2; MPI_Recv(incoming,length,MPI_BYTE,status.MPI_SOURCE,status.MPI_TAG,MPI_COMM_WORLD,&status2); MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&flag,&status); } } void Rank::sendMessages(){ int destination=rand()%getSize(); uint8_t data[4096]; MPI_Send(data,m_messageSize,MPI_BYTE,destination,0,MPI_COMM_WORLD); } Rank::Rank(int argc,char**argv){ m_startingPoint=time(NULL); srand((unsigned)time(NULL)); MPI_Init(&argc,&argv); m_messageSize=atoi(argv[1]); MPI_Comm_rank(MPI_COMM_WORLD,&m_rank); MPI_Comm_size(MPI_COMM_WORLD,&m_size); MPI_Barrier(MPI_COMM_WORLD); run(); MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); } void Rank::run(){ cout<<"Rank "<<getRank()<<" is running, message size is "<<m_messageSize<<endl; time_t last=time(NULL); while(isAlive()){ receiveMessages(); sendMessages(); time_t theTime=time(NULL); if(theTime!=last){ cout<<"Rank "<<getRank()<<": "<<theTime<<" seconds since Unix epoch"<<endl; last=theTime; } } cout<<"Rank "<<getRank()<<" has finished, Thank you for your assistance."<<endl; } int Rank::getRank(){ return m_rank; } int main(int argc,char**argv){ Rank(argc,argv); return EXIT_SUCCESS; }
Program: make-it-hang.cpp mpic++ make-it-hang.cpp -g -o Program vanilla1-30: Program mpirun -np 30 ./Program 1 |& tee mpirun-np-30-Program-1.txt vanilla4096-30: Program mpirun -np 30 ./Program 4096 |& tee mpirun-np-30-Program-4096.txt vanilla4096-2: Program mpirun -np 2 ./Program 4096 |& tee mpirun-np-2-Program-4096.txt no-sm4096-30: Program mpirun --mca btl ^sm -np 30 ./Program 4096 |& tee mpirun-mca-btl-^sm-np-30-Program-4096.txt simple: Simple mpirun -np 2 ./Simple |& tee mpirun-np-2-Simple.txt Simple: test4096.cpp mpic++ test4096.cpp -O3 -o Simple
cat-proc-cpuinfo.txt.gz
Description: GNU Zip compressed data
cat-proc-meminfo.txt.gz
Description: GNU Zip compressed data
dmesg.txt.gz
Description: GNU Zip compressed data
Linux ls30.genome.ulaval.ca 2.6.30.10-105.2.23.fc11.x86_64 #1 SMP Thu Feb 11 07:06:34 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
sudo-sysctl-a.txt.gz
Description: GNU Zip compressed data
g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) Copyright © 2009 Free Software Foundation, Inc. Ce logiciel est libre; voir les sources pour les conditions de copie. Il n'y a PAS GARANTIE; ni implicite pour le MARCHANDAGE ou pour un BUT PARTICULIER.
config.out.gz
Description: GNU Zip compressed data
make.out.gz
Description: GNU Zip compressed data
make-install.out.gz
Description: GNU Zip compressed data
ompi_info.txt.gz
Description: GNU Zip compressed data
Rank 0 is running, message size is 4096 Rank 4 is running, message size is 4096 Rank 8 is running, message size is 4096 Rank 16 is running, message size is 4096 Rank 24 is running, message size is 4096 Rank 3 is running, message size is 4096 Rank 5 is running, message size is 4096 Rank 6 is running, message size is 4096 Rank 7 is running, message size is 4096 Rank 11 is running, message size is 4096 Rank 12 is running, message size is 4096 Rank 13 is running, message size is 4096 Rank 14 is running, message size is 4096 Rank 15 is running, message size is 4096 Rank 19 is running, message size is 4096 Rank 20 is running, message size is 4096 Rank 21 is running, message size is 4096 Rank 22 is running, message size is 4096 Rank 25 is running, message size is 4096 Rank 27 is running, message size is 4096 Rank 28 is running, message size is 4096 Rank 29 is running, message size is 4096 Rank 2 is running, message size is 4096 Rank 18 is running, message size is 4096 Rank 1 is running, message size is 4096 Rank 9 is running, message size is 4096 Rank 17 is running, message size is 4096 Rank 23 is running, message size is 4096 Rank 26 is running, message size is 4096 Rank 10 is running, message size is 4096
mpirun-np-30-Program-1.txt.gz
Description: GNU Zip compressed data
Rank 0 is running, message size is 4096 Rank 1 is running, message size is 4096 -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 16624 on node ls30.genome.ulaval.ca exited on signal 15 (Terminated). --------------------------------------------------------------------------
mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
Description: GNU Zip compressed data
gdb-bt.txt.gz
Description: GNU Zip compressed data