Dear awesome community,

Over the last months, I closely followed the evolution of bug 2043,
entitled 'sm BTL hang with GCC 4.4.x'.

https://svn.open-mpi.org/trac/ompi/ticket/2043

The reason is that I am developping an MPI-based software, and I use
Open-MPI as it is the only implementation I am aware of that send
messages eagerly (powerful feature, that is).

http://denovoassembler.sourceforge.net/

I believe that this very pesky bug remains in Open-MPI 1.4.3, and
enclosed to this communication are scientific proofs of my claim, or at
least I think they are ;).


Each byte transfer layer has its default limit to send eagerly a
message. With shared memory (sm), the value is 4096 bytes. At least it
is according to ompi_info.


To verify this limit, I implemented a very simple test. The source code
is test4096.cpp, which basically just send a single message of 4096
bytes from a rank to another (rank 1 to 0).

The test was conclusive: the limit is 4096 bytes (see
mpirun-np-2-Simple.txt).



Then, I implemented a simple program (103 lines) that makes Open-MPI
1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each
rank send a message to a randomly-selected destination. A rank polls for
new messages with MPI_Iprobe. Each rank prints the current time at each
second during 30 seconds. Using this simple code, I ran 4 test cases,
each with a different outcome (use the Makefile if you want to reproduce
the bug).

Before I describe these cases, I will describe the testing hardware. 

I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). 
The computer has 128 GB of physical memory (see
cat-proc-meminfo.txt.gz).
It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see
dmesg.txt.gz & uname.txt).
Default kernel parameters are utilized at runtime (see
sudo-sysctl-a.txt.gz).

The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g
++--version.txt).


I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz,
make-install.out.gz).
Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz).




Now I can describe the cases.


Case 1: 30 MPI ranks, message size is 4096 bytes

File: mpirun-np-30-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.




Case 2: 30 MPI ranks, message size is 1 byte

File: mpirun-np-30-Program-1.txt.gz
Outcome: It runs just fine.




Case 3: 2 MPI ranks, message size is 4096 bytes

File: mpirun-np-2-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.




Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
disabled

File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
Outcome: It runs just fine.





A backtrace of the processes in Case 1 is in gdb-bt.txt.gz.




Thank you !

#include<mpi.h>
#include<iostream>
using namespace std;

int main(int argc,char**argv){
	int rank;
	int size;
	MPI_Init(&argc,&argv);
	MPI_Comm_rank(MPI_COMM_WORLD,&rank);
	MPI_Comm_size(MPI_COMM_WORLD,&size);
	cout<<"Rank "<<rank<<" welcomes you."<<endl;
	if(rank==0){
		char incoming[4096];
		MPI_Status status;
		MPI_Recv(incoming,4096,MPI_BYTE,1,0,MPI_COMM_WORLD,&status);
	}else if(rank==1){
		char data[4096];
		MPI_Send(data,4096,MPI_BYTE,0,0,MPI_COMM_WORLD);
	}

	cout<<"Rank "<<rank<<" thanks you."<<endl;
	MPI_Finalize();
}
Rank 1 welcomes you.
Rank 0 welcomes you.
Rank 1 thanks you.
Rank 0 thanks you.
/* 
 * Author: Sébastien Boisvert
 * Université Laval
 *
 * sample code to make openmpi-1.4.3 hang
 *
 * excluding the shared memory solves the problem.
 *
 * see Makefile
 *
 *
 */

#include<mpi.h>
#include<stdlib.h>
#include<time.h>
#include<stdio.h>
#include<stdint.h>
#include<iostream>
using namespace std;

class Rank{
	int m_rank;
	int m_size;
	time_t m_startingPoint;
	int m_messageSize;
	void run();
	void receiveMessages();
	void sendMessages();
	int getRank();
	int getSize();
	bool isAlive();
public:
	Rank(int argc,char**argv);

};

int Rank::getSize(){
	return m_size;
}

bool Rank::isAlive(){
	int duration=30;
	return time(NULL)-m_startingPoint<duration;
}

void Rank::receiveMessages(){
	int flag;
	MPI_Status status;
	MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&flag,&status);
	while(flag){
		int length;
		MPI_Get_count(&status,MPI_BYTE,&length);
		uint8_t incoming[4096];
		MPI_Status status2;
		MPI_Recv(incoming,length,MPI_BYTE,status.MPI_SOURCE,status.MPI_TAG,MPI_COMM_WORLD,&status2);
		MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&flag,&status);
	}
}

void Rank::sendMessages(){
	int destination=rand()%getSize();
	uint8_t data[4096];
	MPI_Send(data,m_messageSize,MPI_BYTE,destination,0,MPI_COMM_WORLD);
}

Rank::Rank(int argc,char**argv){
	m_startingPoint=time(NULL);
	srand((unsigned)time(NULL));
	MPI_Init(&argc,&argv);
	m_messageSize=atoi(argv[1]);
	MPI_Comm_rank(MPI_COMM_WORLD,&m_rank);
	MPI_Comm_size(MPI_COMM_WORLD,&m_size);
	MPI_Barrier(MPI_COMM_WORLD);
	run();
	MPI_Barrier(MPI_COMM_WORLD);
	MPI_Finalize();
}

void Rank::run(){
	cout<<"Rank "<<getRank()<<" is running, message size is "<<m_messageSize<<endl;
	time_t last=time(NULL);
	while(isAlive()){
		receiveMessages(); 
		sendMessages();
		time_t theTime=time(NULL);
		if(theTime!=last){
			cout<<"Rank "<<getRank()<<": "<<theTime<<" seconds since Unix epoch"<<endl;
			last=theTime;
		}
	}
	cout<<"Rank "<<getRank()<<" has finished, Thank you for your assistance."<<endl;
}

int Rank::getRank(){
	return m_rank;
}

int main(int argc,char**argv){
	Rank(argc,argv);
	return EXIT_SUCCESS;
}

Program: make-it-hang.cpp
	mpic++ make-it-hang.cpp -g -o Program

vanilla1-30: Program
	mpirun -np 30 ./Program 1 |& tee mpirun-np-30-Program-1.txt

vanilla4096-30: Program
	mpirun -np 30 ./Program 4096 |& tee mpirun-np-30-Program-4096.txt

vanilla4096-2: Program
	mpirun -np 2 ./Program 4096 |& tee mpirun-np-2-Program-4096.txt

no-sm4096-30: Program
	mpirun --mca btl ^sm -np 30 ./Program 4096 |& tee mpirun-mca-btl-^sm-np-30-Program-4096.txt

simple: Simple
	mpirun -np 2 ./Simple |& tee mpirun-np-2-Simple.txt

Simple: test4096.cpp
	mpic++ test4096.cpp -O3 -o Simple

Attachment: cat-proc-cpuinfo.txt.gz
Description: GNU Zip compressed data

Attachment: cat-proc-meminfo.txt.gz
Description: GNU Zip compressed data

Attachment: dmesg.txt.gz
Description: GNU Zip compressed data

Linux ls30.genome.ulaval.ca 2.6.30.10-105.2.23.fc11.x86_64 #1 SMP Thu Feb 11 
07:06:34 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

Attachment: sudo-sysctl-a.txt.gz
Description: GNU Zip compressed data

g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2)
Copyright © 2009 Free Software Foundation, Inc.
Ce logiciel est libre; voir les sources pour les conditions de copie.  Il n'y a 
PAS
GARANTIE; ni implicite pour le MARCHANDAGE ou pour un BUT PARTICULIER.

Attachment: config.out.gz
Description: GNU Zip compressed data

Attachment: make.out.gz
Description: GNU Zip compressed data

Attachment: make-install.out.gz
Description: GNU Zip compressed data

Attachment: ompi_info.txt.gz
Description: GNU Zip compressed data

Rank 0 is running, message size is 4096
Rank 4 is running, message size is 4096
Rank 8 is running, message size is 4096
Rank 16 is running, message size is 4096
Rank 24 is running, message size is 4096
Rank 3 is running, message size is 4096
Rank 5 is running, message size is 4096
Rank 6 is running, message size is 4096
Rank 7 is running, message size is 4096
Rank 11 is running, message size is 4096
Rank 12 is running, message size is 4096
Rank 13 is running, message size is 4096
Rank 14 is running, message size is 4096
Rank 15 is running, message size is 4096
Rank 19 is running, message size is 4096
Rank 20 is running, message size is 4096
Rank 21 is running, message size is 4096
Rank 22 is running, message size is 4096
Rank 25 is running, message size is 4096
Rank 27 is running, message size is 4096
Rank 28 is running, message size is 4096
Rank 29 is running, message size is 4096
Rank 2 is running, message size is 4096
Rank 18 is running, message size is 4096
Rank 1 is running, message size is 4096
Rank 9 is running, message size is 4096
Rank 17 is running, message size is 4096
Rank 23 is running, message size is 4096
Rank 26 is running, message size is 4096
Rank 10 is running, message size is 4096

Attachment: mpirun-np-30-Program-1.txt.gz
Description: GNU Zip compressed data

Rank 0 is running, message size is 4096
Rank 1 is running, message size is 4096
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 16624 on node ls30.genome.ulaval.ca 
exited on signal 15 (Terminated).
--------------------------------------------------------------------------

Attachment: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
Description: GNU Zip compressed data

Attachment: gdb-bt.txt.gz
Description: GNU Zip compressed data

Reply via email to