To run the code I usually do "mpirun -np 6 a.out 10" on a 2 core system. It'll print out the following and then hang:
Target duration (seconds):         10.000000
# of messages sent in that time:      589207
Microseconds per message:             16.972


Terry D. Dontje wrote:

Heard you the first time Gleb, just been backed up with other stuff. Following is the code:

 include "mpif.h"

character(20) cmd_line_arg ! We'll use the first command-line argument
                                ! to set the duration of the test.

real(8) :: duration = 10 ! The default duration (in seconds) can be
                                ! set here.

 real(8) :: endtime             ! This is the time at which we'll end the
                                ! test.

 integer(8) :: nmsgs = 1        ! We'll count the number of messages sent
! out from each MPI process. There will be
                                ! at least one message (at the very end),
                                ! and we'll count all the others.

 logical :: keep_going = .true. ! This flag says whether to keep going.

 ! Initialize MPI stuff.

 call MPI_Init(ier)
 call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
 call MPI_Comm_size(MPI_COMM_WORLD, np, ier)

 if ( np == 1 ) then

   ! Test to make sure there is at least one other process.

   write(6,*) "Need at least 2 processes."
   write(6,*) "Try resubmitting the job with"
   write(6,*) "   'mpirun -np <np>'"
   write(6,*) "where <np> is at least 2."

 else if ( me == 0 ) then

! The first command-line argument is the duration of the test (seconds).

   call get_command_argument(1,cmd_line_arg,len,istat)
   if ( istat == 0 ) read(cmd_line_arg,*) duration

   ! Loop until test is done.

   endtime = MPI_Wtime() + duration     ! figure out when to end
   do while ( MPI_Wtime() < endtime )
     call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)
     nmsgs = nmsgs + 1
   end do

   ! Then, send the closing signal.

   keep_going = .false.
   call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)

   ! Write summary information.

   write(6,'("Target duration (seconds):",f18.6)') duration
   write(6,'("# of messages sent in that time:", i12)') nmsgs
write(6,'("Microseconds per message:", f19.3)') 1.d6 * duration / nmsgs


   ! If you're not Process 0, you need to receive messages
   ! (and possibly relay them onward).

   do while ( keep_going )

     call MPI_Recv(keep_going,1,MPI_LOGICAL,me-1,1,MPI_COMM_WORLD, &

if ( me == np - 1 ) cycle ! The last process only receives messages.

     call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,MPI_COMM_WORLD,ier)

   end do

 end if

 ! Finalize.

 call MPI_Finalize(ier)


Sorry it is in Fortran.

Gleb Natapov wrote:

On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote:
If you are going to look at it, I will not bother with this.

I need the code to reproduce the problem. Otherwise I have nothing to
look at.

On 8/29/07 10:47 AM, "Gleb Natapov" <> wrote:

On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote:
 Are you looking at this ?

Not today. And I need the code to reproduce the bug. Is this possible?


On 8/29/07 9:56 AM, "Gleb Natapov" <> wrote:

On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:
Is this trunk or 1.2?

Oops. I should read more carefully :) This is trunk.

On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
I have a program that does a simple bucket brigade of sends and receives where rank 0 is the start and repeatedly sends to rank 1 until a certain
amount of time has passed and then it sends and all done packet.

Running this under np=2 always works. However, when I run with greater than 2 using only the SM btl the program usually hangs and one of the processes has a long stack that has a lot of the following 3 calls in it:

[25] opal_progress(), line 187 in "opal_progress.c"
[26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
 [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it looks like
the fifo has overflowed.

I am wondering if what is happening is rank 0 has sent a bunch of
messages that have exhausted the
resources such that one of the middle ranks which is in the process of
sending cannot send and therefore
never gets to the point of trying to receive the messages from rank 0?

Is the above a possible scenario or are messages periodically bled off
the SM BTL's fifos?

Note, I have seen np=3 pass sometimes and I can get it to pass reliably if I raise the shared memory space used by the BTL. This is using the


