Hello, On Tue, 2010-11-23 at 18:03 -0500, George Bosilca wrote: > If you know the max size of the receives I would take a different approach. > Post few persistent receives, and manage them in a circular buffer. Instead > of doing an MPI_Iprobe, use MPI_Test on the current head of your circular > buffer. Once you use the data related to the receive, just do an MPI_Start on > your request. >
I implemented your approach, and I must say IT IS FASTER ! My ring has 128 bins. I guess that qualifies as a 'few'. Here are my tests: * Open-MPI 1.4.3 * Infiniband QDR/full bisection topology * 32 MPI ranks (Intel(R) Xeon(R) CPU X5560 @ 2.80GHz) * g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46) * Ray 1.0.0-RC1 * colosse http://www.top500.org/system/10195 with MPI_Iprobe/MPI_Recv (old, r4023) [sboisver12@colosse2 SRA001125]$ cat qsub-openmpi-r4023.sh #!/bin/bash #$ -N iprobe #$ -P nne-790-aa #$ -l h_rt=0:40:00 #$ -pe mpi 32 #$ -M sebastien.boisvert.3@ #$ -m bea module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc /software/MPI/openmpi-1.4.3_gcc/bin/mpirun /home/sboisver12/r4023/code/Ray \ -p /home/sboisver12/nne-790-aa/SRA001125/SRR001665_1.fastq.gz /home/sboisver12/nne-790-aa/SRA001125/SRR001665_2.fastq.gz \ -p /home/sboisver12/nne-790-aa/SRA001125/SRR001666_1.fastq.gz /home/sboisver12/nne-790-aa/SRA001125/SRR001666_2.fastq.gz \ -o Ecoli-THEONE Beginning of computation: 1 seconds Distribution of sequence reads: 7 minutes, 48 seconds Distribution of vertices: 1 minutes, 36 seconds Calculation of coverage distribution: 0 seconds Distribution of edges: 2 minutes, 19 seconds Indexing of sequence reads: 5 seconds Computation of seeds: 3 minutes, 40 seconds Computation of library sizes: 1 minutes, 37 seconds Extension of seeds: 4 minutes, 41 seconds Computation of fusions: 1 minutes, 16 seconds Collection of fusions: 0 seconds Completion of the assembly: 23 minutes, 3 seconds with MPI_Recv_init/MPI_Start (new, HEAD) [sboisver12@colosse2 SRA001125]$ qsub qsub-openmpi-r4023.sh Your job 1031990 ("iprobe") has been submitted [sboisver12@colosse2 SRA001125]$ cat qsub-openmpi.sh #!/bin/bash #$ -N persistent #$ -P nne-790-aa #$ -l h_rt=0:40:00 #$ -pe mpi 32 #$ -M sebastien.boisvert.3@ #$ -m bea module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc /software/MPI/openmpi-1.4.3_gcc/bin/mpirun /home/sboisver12/Ray/trunk/code/Ray \ -p /home/sboisver12/nne-790-aa/SRA001125/SRR001665_1.fastq.gz /home/sboisver12/nne-790-aa/SRA001125/SRR001665_2.fastq.gz \ -p /home/sboisver12/nne-790-aa/SRA001125/SRR001666_1.fastq.gz /home/sboisver12/nne-790-aa/SRA001125/SRR001666_2.fastq.gz \ -o Ecoli-THEONE Beginning of computation: 1 seconds Distribution of sequence reads: 7 minutes, 22 seconds Distribution of vertices: 1 minutes, 28 seconds Calculation of coverage distribution: 1 seconds Distribution of edges: 2 minutes, 14 seconds Indexing of sequence reads: 5 seconds Computation of seeds: 2 minutes, 41 seconds Computation of library sizes: 1 minutes, 14 seconds Extension of seeds: 3 minutes, 47 seconds Computation of fusions: 1 minutes, 0 seconds Collection of fusions: 1 seconds Completion of the assembly: 19 minutes, 54 seconds So: The MPI_Iprobe approach: 23 minutes, 3 seconds The persistent approach proposed by George Bosilca: 19 minutes, 54 seconds > This approach will minimize the unexpected messages, and drain the > connections faster. Moreover, at the end it is very easy to MPI_Cancel all > the receives not yet matched. I see. > george. Thank you ! p.s.: I learned a lot on MPI since my first post here ! > > On Nov 23, 2010, at 17:43 , Sébastien Boisvert wrote: > > > Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit : > >> The eager size reported by ompi_info includes the Open MPI internal > >> headers. They are anywhere between 20 and 64 bytes long (potentially more > >> for some particular networks), so what Eugene suggested was a safe > >> boundary. > > > > I see. > > > >> > >> Moreover, eager send can improve performance if and only if the matching > >> receives are already posted on the peer. If not, the data will become > >> unexpected, and there will be one additional memcpy. > > > > So it won't improve performance in my application (Ray, > > http://denovoassembler.sf.net) because I use MPI_Iprobe to check for > > incoming messages, which means any receive (MPI_Recv) is never posted > > before any send (MPI_Isend). > > > > Thanks, this thread is very informative for me ! > > > >> > >> george. > >> > >> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote: > >> > >>> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : > >>>> Sébastien Boisvert wrote: > >>>> > >>>>> Now I can describe the cases. > >>>>> > >>>>> > >>>> The test cases can all be explained by the test requiring eager messages > >>>> (something that test4096.cpp does not require). > >>>> > >>>>> Case 1: 30 MPI ranks, message size is 4096 bytes > >>>>> > >>>>> File: mpirun-np-30-Program-4096.txt > >>>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > >>>>> > >>>>> > >>>> 4096 is rendezvous. For eager, try 4000 or lower. > >>> > >>> According to ompi_info, the threshold is 4096, not 4000, right ? > >>> > >>> (Open-MPI 1.4.3) > >>> [sboisver12@colosse1 ~]$ ompi_info -a|less > >>> MCA btl: parameter "btl_sm_eager_limit" (current value: > >>> "4096", data source: default value) > >>> Maximum size (in bytes) of "short" messages > >>> (must be >= 1). > >>> > >>> > >>> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" -- > >>> that is, a sender attempts to write its entire message to shared buffers > >>> without waiting for a receiver to be ready. Above this size, a sender > >>> will only write the first part of a message, then wait for the receiver > >>> to acknowledge its ready before continuing. Eager sends can improve > >>> performance by decoupling senders from receivers." > >>> > >>> > >>> > >>> source: > >>> http://www.open-mpi.org/faq/?category=sm#more-sm > >>> > >>> > >>> It should say "Below this size or equal to this size" instead of "Below > >>> this size" as ompi_info says. ;) > >>> > >>> > >>> > >>> > >>> As Mr. George Bosilca put it: > >>> > >>> "__should__ is not correct, __might__ is a better verb to describe the > >>> most "common" behavior for small messages. The problem comes from the > >>> fact that in each communicator the FIFO ordering is required by the MPI > >>> standard. As soon as there is any congestion, MPI_Send will block even > >>> for small messages (and this independent on the underlying network) > >>> until all he pending packets have been delivered." > >>> > >>> source: > >>> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php > >>> > >>> > >>> > >>>> > >>>>> Case 2: 30 MPI ranks, message size is 1 byte > >>>>> > >>>>> File: mpirun-np-30-Program-1.txt.gz > >>>>> Outcome: It runs just fine. > >>>>> > >>>>> > >>>> 1 byte is eager. > >>> > >>> I agree. > >>> > >>>> > >>>>> Case 3: 2 MPI ranks, message size is 4096 bytes > >>>>> > >>>>> File: mpirun-np-2-Program-4096.txt > >>>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > >>>>> > >>>>> > >>>> Same as Case 1. > >>>> > >>>>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is > >>>>> disabled > >>>>> > >>>>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz > >>>>> Outcome: It runs just fine. > >>>>> > >>>>> > >>>> Eager limit for TCP is 65536 (perhaps less some overhead). So, these > >>>> messages are eager. > >>> > >>> I agree. > >>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > > M. Sébastien Boisvert > > Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval > > Boursier des Instituts de recherche en santé du Canada > > Équipe du Professeur Jacques Corbeil > > > > Centre de recherche en infectiologie de l'Université Laval > > Local R-61B > > 2705, boulevard Laurier > > Québec, Québec > > Canada G1V 4G2 > > Téléphone: 418 525 4444 46342 > > > > Courriel: s...@boisvert.info > > Web: http://boisvert.info > > > > "Innovation comes only from an assault on the unknown" -Sydney Brenner > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel "Innovation comes only from an assault on the unknown" -Sydney Brenner
/* Ray Copyright (C) 2010 Sébastien Boisvert http://DeNovoAssembler.SourceForge.Net/ This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You have received a copy of the GNU General Public License along with this program (COPYING). see <http://www.gnu.org/licenses/> */ #include<MessagesHandler.h> #include<common_functions.h> #include<assert.h> /* * send messages, */ void MessagesHandler::sendMessages(StaticVector*outbox,int source){ for(int i=0;i<(int)outbox->size();i++){ Message*aMessage=((*outbox)[i]); #ifdef ASSERT int destination=aMessage->getDestination(); assert(destination>=0); #endif MPI_Request request; // MPI_Issend // Synchronous nonblocking. Note that a Wait/Test will complete only when the matching receive is posted #ifdef ASSERT assert(!(aMessage->getBuffer()==NULL && aMessage->getCount()>0)); #endif #ifndef ASSERT MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request); #else int value=MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request); assert(value==MPI_SUCCESS); #endif MPI_Request_free(&request); #ifdef ASSERT assert(request==MPI_REQUEST_NULL); #endif } outbox->clear(); } /* * receiveMessages is implemented as recommanded by Mr. George Bosilca from the University of Tennessee (via the Open-MPI mailing list) De: George Bosilca <bosilca@…> Reply-to: Open MPI Developers <devel@…> À: Open MPI Developers <devel@…> Sujet: Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang List-Post: devel@lists.open-mpi.org Date: 2010-11-23 18:03:04 If you know the max size of the receives I would take a different approach. Post few persistent receives, and manage them in a circular buffer. Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your circular buffer. Once you use the data related to the receive, just do an MPI_Start on your request. This approach will minimize the unexpected messages, and drain the connections faster. Moreover, at the end it is very easy to MPI_Cancel all the receives not yet matched. george. */ void MessagesHandler::receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination){ int flag; MPI_Status status; MPI_Test(m_ring+m_head,&flag,&status); if(flag){ // get the length of the message // it is not necessary the same as the one posted with MPI_Recv_init // that one was a lower bound int tag=status.MPI_TAG; int source=status.MPI_SOURCE; int length; MPI_Get_count(&status,MPI_UNSIGNED_LONG_LONG,&length); u64*filledBuffer=(u64*)m_buffers+m_head*MPI_BTL_SM_EAGER_LIMIT/sizeof(u64); // copy it in a safe buffer u64*incoming=(u64*)inboxAllocator->allocate(length*sizeof(u64)); for(int i=0;i<length;i++){ incoming[i]=filledBuffer[i]; } // the request can start again MPI_Start(m_ring+m_head); // add the message in the inbox Message aMessage(incoming,length,MPI_UNSIGNED_LONG_LONG,source,tag,source); inbox->push_back(aMessage); m_receivedMessages[source]++; // increment the head m_head++; if(m_head==m_ringSize){ m_head=0; } } } void MessagesHandler::showStats(){ cout<<"Rank "<<m_rank; for(int i=0;i<m_size;i++){ cout<<" "<<m_receivedMessages[i]; } cout<<endl; } void MessagesHandler::addCount(int rank,u64 count){ m_allReceivedMessages[rank*m_size+m_allCounts[rank]]=count; m_allCounts[rank]++; } bool MessagesHandler::isFinished(int rank){ return m_allCounts[rank]==m_size; } bool MessagesHandler::isFinished(){ for(int i=0;i<m_size;i++){ if(!isFinished(i)){ return false; } } // update the counts for root, because it was updated. for(int i=0;i<m_size;i++){ m_allCounts[MASTER_RANK*m_size+i]=m_receivedMessages[i]; } return true; } void MessagesHandler::writeStats(const char*file){ FILE*f=fopen(file,"w+"); for(int i=0;i<m_size;i++){ fprintf(f,"\t%i",i); } fprintf(f,"\n"); for(int i=0;i<m_size;i++){ fprintf(f,"%i",i); for(int j=0;j<m_size;j++){ fprintf(f,"\t%lu",m_allReceivedMessages[i*m_size+j]); } fprintf(f,"\n"); } fclose(f); } void MessagesHandler::constructor(int rank,int size){ m_rank=rank; m_size=size; m_receivedMessages=(u64*)__Malloc(sizeof(u64)*m_size); if(rank==MASTER_RANK){ m_allReceivedMessages=(u64*)__Malloc(sizeof(u64)*m_size*m_size); m_allCounts=(int*)__Malloc(sizeof(int)*m_size); } for(int i=0;i<m_size;i++){ m_receivedMessages[i]=0; if(rank==MASTER_RANK){ m_allCounts[i]=0; } } // the ring contains 128 elements. m_ringSize=128; m_ring=(MPI_Request*)__Malloc(sizeof(MPI_Request)*m_ringSize); m_buffers=(char*)__Malloc(MPI_BTL_SM_EAGER_LIMIT*m_ringSize); m_head=0; // post a few receives. for(int i=0;i<m_ringSize;i++){ void*buffer=m_buffers+i*MPI_BTL_SM_EAGER_LIMIT; MPI_Recv_init(buffer,MPI_BTL_SM_EAGER_LIMIT/sizeof(VERTEX_TYPE),MPI_UNSIGNED_LONG_LONG, MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,m_ring+i); MPI_Start(m_ring+i); } } u64*MessagesHandler::getReceivedMessages(){ return m_receivedMessages; } void MessagesHandler::freeLeftovers(){ for(int i=0;i<m_ringSize;i++){ MPI_Cancel(m_ring+i); MPI_Request_free(m_ring+i); } __Free(m_ring); __Free(m_buffers); }
/* Ray Copyright (C) 2010 Sébastien Boisvert http://DeNovoAssembler.SourceForge.Net/ This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You have received a copy of the GNU General Public License along with this program (COPYING). see <http://www.gnu.org/licenses/> */ #ifndef _MessagesHandler #define _MessagesHandler #include<vector> #include<MyAllocator.h> #include<Message.h> #include<common_functions.h> #include<RingAllocator.h> #include<StaticVector.h> #include<PendingRequest.h> using namespace std; class MessagesHandler{ int m_ringSize; int m_head; MPI_Request*m_ring; char*m_buffers; u64*m_receivedMessages; int m_rank; int m_size; u64*m_allReceivedMessages; int*m_allCounts; public: void constructor(int rank,int size); void showStats(); void sendMessages(StaticVector*outbox,int source); void receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination); u64*getReceivedMessages(); void addCount(int rank,u64 count); void writeStats(const char*file); bool isFinished(); bool isFinished(int rank); void freeLeftovers(); }; #endif