Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

Sébastien Boisvert Wed, 24 Nov 2010 21:36:14 -0500

Hello,

On Tue, 2010-11-23 at 18:03 -0500, George Bosilca wrote: 
> If you know the max size of the receives I would take a different approach. 
> Post few persistent receives, and manage them in a circular buffer. Instead 
> of doing an MPI_Iprobe, use MPI_Test on the current head of your circular 
> buffer. Once you use the data related to the receive, just do an MPI_Start on 
> your request.
>


I implemented your approach, and I must say IT IS FASTER !
My ring has 128 bins. I guess that qualifies as a 'few'.


Here are my tests:
* Open-MPI 1.4.3
* Infiniband QDR/full bisection topology
* 32 MPI ranks (Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz)
* g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
* Ray 1.0.0-RC1
* colosse http://www.top500.org/system/10195


with MPI_Iprobe/MPI_Recv (old, r4023)


[sboisver12@colosse2 SRA001125]$ cat qsub-openmpi-r4023.sh
#!/bin/bash
#$ -N iprobe
#$ -P nne-790-aa
#$ -l h_rt=0:40:00
#$ -pe mpi 32
#$ -M sebastien.boisvert.3@
#$ -m bea
module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc
/software/MPI/openmpi-1.4.3_gcc/bin/mpirun /home/sboisver12/r4023/code/Ray  \
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001665_1.fastq.gz 
/home/sboisver12/nne-790-aa/SRA001125/SRR001665_2.fastq.gz \
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001666_1.fastq.gz 
/home/sboisver12/nne-790-aa/SRA001125/SRR001666_2.fastq.gz \
-o Ecoli-THEONE



 Beginning of computation: 1 seconds
 Distribution of sequence reads: 7 minutes, 48 seconds
 Distribution of vertices: 1 minutes, 36 seconds
 Calculation of coverage distribution: 0 seconds
 Distribution of edges: 2 minutes, 19 seconds
 Indexing of sequence reads: 5 seconds
 Computation of seeds: 3 minutes, 40 seconds
 Computation of library sizes: 1 minutes, 37 seconds
 Extension of seeds: 4 minutes, 41 seconds
 Computation of fusions: 1 minutes, 16 seconds
 Collection of fusions: 0 seconds
 Completion of the assembly: 23 minutes, 3 seconds




with MPI_Recv_init/MPI_Start (new, HEAD)




[sboisver12@colosse2 SRA001125]$ qsub qsub-openmpi-r4023.sh
Your job 1031990 ("iprobe") has been submitted
[sboisver12@colosse2 SRA001125]$ cat qsub-openmpi.sh
#!/bin/bash
#$ -N persistent
#$ -P nne-790-aa
#$ -l h_rt=0:40:00
#$ -pe mpi 32
#$ -M sebastien.boisvert.3@
#$ -m bea
module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc
/software/MPI/openmpi-1.4.3_gcc/bin/mpirun /home/sboisver12/Ray/trunk/code/Ray  
\
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001665_1.fastq.gz 
/home/sboisver12/nne-790-aa/SRA001125/SRR001665_2.fastq.gz \
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001666_1.fastq.gz 
/home/sboisver12/nne-790-aa/SRA001125/SRR001666_2.fastq.gz \
-o Ecoli-THEONE


 Beginning of computation: 1 seconds
 Distribution of sequence reads: 7 minutes, 22 seconds
 Distribution of vertices: 1 minutes, 28 seconds
 Calculation of coverage distribution: 1 seconds
 Distribution of edges: 2 minutes, 14 seconds
 Indexing of sequence reads: 5 seconds
 Computation of seeds: 2 minutes, 41 seconds
 Computation of library sizes: 1 minutes, 14 seconds
 Extension of seeds: 3 minutes, 47 seconds
 Computation of fusions: 1 minutes, 0 seconds
 Collection of fusions: 1 seconds
 Completion of the assembly: 19 minutes, 54 seconds



So:

The MPI_Iprobe approach: 

23 minutes, 3 seconds


The persistent approach proposed by George Bosilca:

19 minutes, 54 seconds


> This approach will minimize the unexpected messages, and drain the 
> connections faster. Moreover, at the end it is very easy to MPI_Cancel all 
> the receives not yet matched.

I see.

>   george.

Thank you !

p.s.: I learned a lot on MPI since my first post here !

> 
> On Nov 23, 2010, at 17:43 , Sébastien Boisvert wrote:
> 
> > Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit :
> >> The eager size reported by ompi_info includes the Open MPI internal 
> >> headers. They are anywhere between 20 and 64 bytes long (potentially more 
> >> for some particular networks), so what Eugene suggested was a safe 
> >> boundary.
> > 
> > I see.
> > 
> >> 
> >> Moreover, eager send can improve performance if and only if the matching 
> >> receives are already posted on the peer. If not, the data will become 
> >> unexpected, and there will be one additional memcpy.
> > 
> > So it won't improve performance in my application (Ray,
> > http://denovoassembler.sf.net) because I use MPI_Iprobe to check for
> > incoming messages, which means any receive (MPI_Recv) is never posted
> > before any send (MPI_Isend).
> > 
> > Thanks, this thread is very informative for me !
> > 
> >> 
> >>  george.
> >> 
> >> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote:
> >> 
> >>> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
> >>>> Sébastien Boisvert wrote:
> >>>> 
> >>>>> Now I can describe the cases.
> >>>>> 
> >>>>> 
> >>>> The test cases can all be explained by the test requiring eager messages 
> >>>> (something that test4096.cpp does not require).
> >>>> 
> >>>>> Case 1: 30 MPI ranks, message size is 4096 bytes
> >>>>> 
> >>>>> File: mpirun-np-30-Program-4096.txt
> >>>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >>>>> 
> >>>>> 
> >>>> 4096 is rendezvous.  For eager, try 4000 or lower.
> >>> 
> >>> According to ompi_info, the threshold is 4096, not 4000, right ?
> >>> 
> >>> (Open-MPI 1.4.3)
> >>> [sboisver12@colosse1 ~]$ ompi_info -a|less
> >>>                MCA btl: parameter "btl_sm_eager_limit" (current value:
> >>> "4096", data source: default value)
> >>>                         Maximum size (in bytes) of "short" messages
> >>> (must be >= 1).
> >>> 
> >>> 
> >>> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
> >>> that is, a sender attempts to write its entire message to shared buffers
> >>> without waiting for a receiver to be ready. Above this size, a sender
> >>> will only write the first part of a message, then wait for the receiver
> >>> to acknowledge its ready before continuing. Eager sends can improve
> >>> performance by decoupling senders from receivers."
> >>> 
> >>> 
> >>> 
> >>> source:
> >>> http://www.open-mpi.org/faq/?category=sm#more-sm
> >>> 
> >>> 
> >>> It should say "Below this size or equal to this size" instead of "Below
> >>> this size" as ompi_info says. ;)
> >>> 
> >>> 
> >>> 
> >>> 
> >>> As Mr. George Bosilca put it:
> >>> 
> >>> "__should__ is not correct, __might__ is a better verb to describe the
> >>> most "common" behavior for small messages. The problem comes from the
> >>> fact that in each communicator the FIFO ordering is required by the MPI
> >>> standard. As soon as there is any congestion, MPI_Send will block even
> >>> for small messages (and this independent on the underlying network)
> >>> until all he pending packets have been delivered."
> >>> 
> >>> source:
> >>> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php
> >>> 
> >>> 
> >>> 
> >>>> 
> >>>>> Case 2: 30 MPI ranks, message size is 1 byte
> >>>>> 
> >>>>> File: mpirun-np-30-Program-1.txt.gz
> >>>>> Outcome: It runs just fine.
> >>>>> 
> >>>>> 
> >>>> 1 byte is eager.
> >>> 
> >>> I agree.
> >>> 
> >>>> 
> >>>>> Case 3: 2 MPI ranks, message size is 4096 bytes
> >>>>> 
> >>>>> File: mpirun-np-2-Program-4096.txt
> >>>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >>>>> 
> >>>>> 
> >>>> Same as Case 1.
> >>>> 
> >>>>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> >>>>> disabled
> >>>>> 
> >>>>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> >>>>> Outcome: It runs just fine.
> >>>>> 
> >>>>> 
> >>>> Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
> >>>> messages are eager.
> >>> 
> >>> I agree.
> >>> 
> >>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >> 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > -- 
> > M. Sébastien Boisvert
> > Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
> > Boursier des Instituts de recherche en santé du Canada
> > Équipe du Professeur Jacques Corbeil
> > 
> > Centre de recherche en infectiologie de l'Université Laval
> > Local R-61B
> > 2705, boulevard Laurier
> > Québec, Québec
> > Canada G1V 4G2
> > Téléphone: 418 525 4444 46342
> > 
> > Courriel: s...@boisvert.info
> > Web: http://boisvert.info
> > 
> > "Innovation comes only from an assault on the unknown" -Sydney Brenner
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



"Innovation comes only from an assault on the unknown" -Sydney Brenner

/*
 	Ray
    Copyright (C) 2010  Sébastien Boisvert

	http://DeNovoAssembler.SourceForge.Net/

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, version 3 of the License.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You have received a copy of the GNU General Public License
    along with this program (COPYING).  
	see <http://www.gnu.org/licenses/>

*/

#include<MessagesHandler.h>
#include<common_functions.h>
#include<assert.h>


/*
 * send messages,
 */
void MessagesHandler::sendMessages(StaticVector*outbox,int source){
	for(int i=0;i<(int)outbox->size();i++){
		Message*aMessage=((*outbox)[i]);
		#ifdef ASSERT
		int destination=aMessage->getDestination();
		assert(destination>=0);
		#endif

		MPI_Request request;
		//  MPI_Issend
		//      Synchronous nonblocking. Note that a Wait/Test will complete only when the matching receive is posted
		#ifdef ASSERT
		assert(!(aMessage->getBuffer()==NULL && aMessage->getCount()>0));
		#endif
		#ifndef ASSERT
		MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request);
		#else
		int value=MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request);
		assert(value==MPI_SUCCESS);
		#endif

		MPI_Request_free(&request);

		#ifdef ASSERT
		assert(request==MPI_REQUEST_NULL);
		#endif
	}

	outbox->clear();
}



/*	
 * receiveMessages is implemented as recommanded by Mr. George Bosilca from
the University of Tennessee (via the Open-MPI mailing list)

De: George Bosilca <bosilca@…>
Reply-to: Open MPI Developers <devel@…>
À: Open MPI Developers <devel@…>
Sujet: Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
List-Post: devel@lists.open-mpi.org
Date: 2010-11-23 18:03:04

If you know the max size of the receives I would take a different approach. 
Post few persistent receives, and manage them in a circular buffer. 
Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your circular buffer. 
Once you use the data related to the receive, just do an MPI_Start on your request.
This approach will minimize the unexpected messages, and drain the connections faster. 
Moreover, at the end it is very easy to MPI_Cancel all the receives not yet matched.

    george. 
 */

void MessagesHandler::receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination){
	int flag;
	MPI_Status status;
	MPI_Test(m_ring+m_head,&flag,&status);

	if(flag){
		// get the length of the message
		// it is not necessary the same as the one posted with MPI_Recv_init
		// that one was a lower bound
		int tag=status.MPI_TAG;
		int source=status.MPI_SOURCE;
		int length;
		MPI_Get_count(&status,MPI_UNSIGNED_LONG_LONG,&length);
		u64*filledBuffer=(u64*)m_buffers+m_head*MPI_BTL_SM_EAGER_LIMIT/sizeof(u64);

		// copy it in a safe buffer
		u64*incoming=(u64*)inboxAllocator->allocate(length*sizeof(u64));
		for(int i=0;i<length;i++){
			incoming[i]=filledBuffer[i];
		}

		// the request can start again
		MPI_Start(m_ring+m_head);

		// add the message in the inbox
		Message aMessage(incoming,length,MPI_UNSIGNED_LONG_LONG,source,tag,source);
		inbox->push_back(aMessage);
		m_receivedMessages[source]++;

		// increment the head
		m_head++;
		if(m_head==m_ringSize){
			m_head=0;
		}
	}
}

void MessagesHandler::showStats(){
	cout<<"Rank "<<m_rank;
	for(int i=0;i<m_size;i++){
		cout<<" "<<m_receivedMessages[i];
	}
	cout<<endl;
}

void MessagesHandler::addCount(int rank,u64 count){
	m_allReceivedMessages[rank*m_size+m_allCounts[rank]]=count;
	m_allCounts[rank]++;
}

bool MessagesHandler::isFinished(int rank){
	return m_allCounts[rank]==m_size;
}

bool MessagesHandler::isFinished(){
	for(int i=0;i<m_size;i++){
		if(!isFinished(i)){
			return false;
		}
	}

	// update the counts for root, because it was updated.
	for(int i=0;i<m_size;i++){
		m_allCounts[MASTER_RANK*m_size+i]=m_receivedMessages[i];
	}

	return true;
}

void MessagesHandler::writeStats(const char*file){
	FILE*f=fopen(file,"w+");

	for(int i=0;i<m_size;i++){
		fprintf(f,"\t%i",i);
	}

	fprintf(f,"\n");

	for(int i=0;i<m_size;i++){
		fprintf(f,"%i",i);
		for(int j=0;j<m_size;j++){
			fprintf(f,"\t%lu",m_allReceivedMessages[i*m_size+j]);
		}
		fprintf(f,"\n");
	}
	fclose(f);
}

void MessagesHandler::constructor(int rank,int size){
	m_rank=rank;
	m_size=size;
	m_receivedMessages=(u64*)__Malloc(sizeof(u64)*m_size);
	if(rank==MASTER_RANK){
		m_allReceivedMessages=(u64*)__Malloc(sizeof(u64)*m_size*m_size);
		m_allCounts=(int*)__Malloc(sizeof(int)*m_size);
	}

	for(int i=0;i<m_size;i++){
		m_receivedMessages[i]=0;
		if(rank==MASTER_RANK){
			m_allCounts[i]=0;
		}
	}

	// the ring contains 128 elements.
	m_ringSize=128;
	m_ring=(MPI_Request*)__Malloc(sizeof(MPI_Request)*m_ringSize);
	m_buffers=(char*)__Malloc(MPI_BTL_SM_EAGER_LIMIT*m_ringSize);
	m_head=0;

	// post a few receives.
	for(int i=0;i<m_ringSize;i++){
		void*buffer=m_buffers+i*MPI_BTL_SM_EAGER_LIMIT;
		MPI_Recv_init(buffer,MPI_BTL_SM_EAGER_LIMIT/sizeof(VERTEX_TYPE),MPI_UNSIGNED_LONG_LONG,
			MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,m_ring+i);
		MPI_Start(m_ring+i);
	}
}

u64*MessagesHandler::getReceivedMessages(){
	return m_receivedMessages;
}

void MessagesHandler::freeLeftovers(){
	for(int i=0;i<m_ringSize;i++){
		MPI_Cancel(m_ring+i);
		MPI_Request_free(m_ring+i);
	}
	__Free(m_ring);
	__Free(m_buffers);
}

/*
 	Ray
    Copyright (C) 2010  Sébastien Boisvert

	http://DeNovoAssembler.SourceForge.Net/

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, version 3 of the License.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You have received a copy of the GNU General Public License
    along with this program (COPYING).  
	see <http://www.gnu.org/licenses/>

*/

#ifndef _MessagesHandler
#define _MessagesHandler

#include<vector>
#include<MyAllocator.h>
#include<Message.h>
#include<common_functions.h>
#include<RingAllocator.h>
#include<StaticVector.h>
#include<PendingRequest.h>
using namespace std;


class MessagesHandler{
	int m_ringSize;
	int m_head;
	MPI_Request*m_ring;
	char*m_buffers;

	u64*m_receivedMessages;
	int m_rank;
	int m_size;

	u64*m_allReceivedMessages;
	int*m_allCounts;

public:
	void constructor(int rank,int size);
	void showStats();
	void sendMessages(StaticVector*outbox,int source);
	void receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination);
	u64*getReceivedMessages();
	void addCount(int rank,u64 count);
	void writeStats(const char*file);
	bool isFinished();
	bool isFinished(int rank);
	void freeLeftovers();
};

#endif

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

Reply via email to