Oops - yep, that is an oversight! Will fix - thanks! On Feb 9, 2010, at 7:13 AM, Guillaume Thouvenin wrote:
> Hello, > > It seems that a return value is not updated during the setup of > process affinity in function ompi_mpi_init() > ompi/runtime/ompi_mpi_init.c:459 > > The problem is in the following piece of code: > > [... here ret == OPAL_SUCCESS ...] > phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank); > if (0 > phys_cpu) { > error = "Could not get physical processor id - cannot set processor > affinity"; > goto error; > } > [...] > > If opal_paffinity_base_get_physical_processor_id() failed ret is not > updated and we will reach the "error:" label while ret == OPAL_SUCCESS. > > As a result MPI_Init() will return without having initialized the > MPI_COMM_WORLD struct leading to a segmentation fault on calls like > MPI_Comm_size(). > > I got the bug recently with new westmere processors for which the > function opal_paffinity_base_get_physical_processor_id() failed if we > are using the mca parameter "opal_paffinity_alone 1" during the > execution. > > I'm not sure that it's the right way to fix the problem but here is a > patch tested with v1.5. This patch allows to report the problem instead > of generating a segmentation fault. > > With the patch, the output is: > > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > Could not get physical processor id - cannot set processor affinity > --> Returned "Not found" (-5) instead of "Success" (0) > -------------------------------------------------------------------------- > > Without the patch, the output was: > > *** Process received signal *** > Signal: Segmentation fault (11) > Signal code: Address not mapped (1) > Failing at address: 0x10 > [ 0] /lib64/libpthread.so.0 [0x3d4e20ee90] > [ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) > [0x7fce74468dfc] > [ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f] > [ 3] ./IMB-MPI1(main+0x65) [0x4035c5] > [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d] > [ 5] ./IMB-MPI1 [0x403499] > > > Regards, > Guillaume > > --- > diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c > --- a/ompi/runtime/ompi_mpi_init.c > +++ b/ompi/runtime/ompi_mpi_init.c > @@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv, > OPAL_PAFFINITY_CPU_ZERO(mask); > phys_cpu = > opal_paffinity_base_get_physical_processor_id(nrank); > if (0 > phys_cpu) { > + ret = phys_cpu; > error = "Could not get physical processor id - cannot set > processor affinity"; > goto error; > } > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel