Dear openmpi developers,
I have very recently joined this list to be able to submit this report, as
recommended in the FAQ, why I have not had time to familiarize myself with
the etiquette of this list, for which I apologize in advance. If you
require any additional information I am more than happy to provide you
with such information.
One of the users of our computer cluster have reported a problem with his
simulation code. We have been able to trace this down to a problem with
the MPI struct when absolute addressing is used for the members in the
struct, necessating the use of MPI_BOTTOM. The problem occurs with
MPI_BCAST, but works fine with MPI_SEND and MPI_RECV. I digged down into
the openmpi code and found out that the problem occurs only in fortran,
which seems to be because when MPI_BCAST is called, the OMPI_ADDR macro in
ompi/mpi/f77/constants.h is never evaluated which thus never turns the
fortran MPI_BOTTOM into a C MPI_BOTTOM. When MPI_SEND and MPI_RECV are
used the OMPI_ADDR macro is evaluated and no problem occurs.
An example of a problematic code (tested on two processes):
program testme
implicit none
include 'mpif.h'
integer ierr,rank,size
integer btype(1),blen(1),param_batch
real param
integer(kind=mpi_address_kind) :: disp(1)
integer status(MPI_STATUS_SIZE)
param=0.
call MPI_INIT(ierr)
call MPI_GET_ADDRESS(param,disp(1),ierr)
if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_GET_ADDRESS FAILED'
btype(1)=MPI_REAL
blen(1)=1
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
call MPI_TYPE_CREATE_STRUCT(1,blen,disp,btype,param_batch,ierr)
if (ierr.ne.MPI_SUCCESS) then
write(*,*) 'MPI_TYPE_CREATE_STRUCT FAILED'
endif
call MPI_TYPE_COMMIT(param_batch,ierr)
if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_TYPE_COMMIT FAILED'
if (rank.eq.0) param=100.
c if (rank.eq.0) then
c call MPI_SEND(MPI_BOTTOM,1,
c x param_batch,1,0,MPI_COMM_WORLD,ierr)
c else
c call MPI_RECV(MPI_BOTTOM,1,
c x param_batch,0,0,MPI_COMM_WORLD,status,ierr)
c endif
call MPI_BCAST(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD,ierr)
if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_BCAST FAILED'
write(*,*) 'Rank:',rank,'Size:',size,'Param=',param
call MPI_Finalize(ierr)
end program testme
mpirun -np 2 testme_submitted_to_devel
[auchentoshan:21021] *** Process received signal ***
[auchentoshan:21021] Signal: Segmentation fault (11)
[auchentoshan:21021] Signal code: Address not mapped (1)
[auchentoshan:21021] Failing at address: 0x7fc04ffd7c
[auchentoshan:21021] [ 0] /lib64/tls/libpthread.so.0 [0x3d9660c430]
[auchentoshan:21021] [ 1] /lib64/tls/libc.so.6(memcpy+0x60) [0x3d95571ec0]
[auchentoshan:21021] [ 2]
/opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(ompi_convertor_pack+0x164)
[0x2a957cf5b4]
[auchentoshan:21021] [ 3]
/opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x24c)
[0x2a982851dc]
[auchentoshan:21021] [ 4]
/opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x217)
[0x2a9827fe77]
[auchentoshan:21021] [ 5]
/opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_generic+0x354)
[0x2a98ac08b4]
[auchentoshan:21021] [ 6]
/opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_binomial+0xc8)
[0x2a98ac0bd8]
[auchentoshan:21021] [ 7]
/opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(PMPI_Bcast+0x15c) [0x2a957d62ac]
[auchentoshan:21021] [ 8]
/opt/openmpi-1.2.1-gcc4/lib/libmpi_f77.so.0(pmpi_bcast_+0x5a)
[0x2a9567e99a]
[auchentoshan:21021] [ 9] testme_submitted_to_devel(MAIN__+0x1f8)
[0x401080]
[auchentoshan:21021] [10] testme_submitted_to_devel(main+0xe) [0x4011be]
[auchentoshan:21021] [11] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x3d9551c3fb]
[auchentoshan:21021] [12] testme_submitted_to_devel [0x400dfa]
[auchentoshan:21021] *** End of error message ***
mpirun noticed that job rank 0 with PID 21021 on node auchentoshan exited
on signal 11 (Segmentation fault).
1 additional process aborted (not shown)
The openmpi version we have tested this on includes the latest, 1.2.2
version as well although the log message above is for version 1.2.1.
Distribution: Scientific Linux 4.4 (RHEL4 clone). The problem occurs on
both AMD64 and i386. The problem occurs both when using gcc/gfortran and
the portland group compilers version 7.0-4. Interconnect GBE.
daniels@auchentoshan:~ > gcc4 --version
gcc4 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
daniels@auchentoshan:~ > gfortran --version
GNU Fortran 95 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
Copyright (C) 2006 Free Software Foundation, Inc.
GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING
Working C code which does the same thing:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int rank,size;
float param=0.;
MPI_Datatype btype[1],param_batch;
MPI_Aint disp[1];
int blen[1];
MPI_Init(&argc,&argv);
if (MPI_Get_address(¶m,&disp[0])!=MPI_SUCCESS)
fprintf(stderr,"MPI_Get_address failed.\n");
btype[0]=MPI_FLOAT;
blen[0]=1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
if (MPI_Type_create_struct(1,blen,disp,btype,¶m_batch)!=MPI_SUCCESS)
fprintf(stderr,"MPI_Type_Create_Struct failed.\n");
if (MPI_Type_commit(¶m_batch)!=MPI_SUCCESS)
fprintf(stderr,"MPI_Type_Commit failed.\n");
if (rank==0)
param=100.;
if (MPI_Bcast(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD)!=MPI_SUCCESS)
fprintf(stderr,"MPI_Bcast failed.\n");
printf("Rank:%d, Size:%d, Param=%f\n",rank,size,param);
MPI_Finalize();
return 0;
}
Best regards
Daniel Spångberg