Oops - yep, that is an oversight! Will fix - thanks!

On Feb 9, 2010, at 7:13 AM, Guillaume Thouvenin wrote:

> Hello,
> 
> It seems that a return value is not updated during the setup of
> process affinity in function ompi_mpi_init()
> ompi/runtime/ompi_mpi_init.c:459
> 
> The problem is in the following piece of code:
> 
>    [... here ret == OPAL_SUCCESS ...]
>    phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
>    if (0 > phys_cpu) {
>        error = "Could not get physical processor id - cannot set processor 
> affinity";
>        goto error;
>    }
>    [...]
> 
> If opal_paffinity_base_get_physical_processor_id() failed ret is not
> updated and we will reach the "error:" label while ret == OPAL_SUCCESS.
> 
> As a result MPI_Init() will return without having initialized the
> MPI_COMM_WORLD struct leading to a segmentation fault on calls like
> MPI_Comm_size().
> 
> I got the bug recently with new westmere processors for which the
> function opal_paffinity_base_get_physical_processor_id() failed if we
> are using the mca parameter "opal_paffinity_alone 1" during the
> execution.
> 
> I'm not sure that it's the right way to fix the problem but here is a
> patch tested with v1.5. This patch allows to report the problem instead
> of generating a segmentation fault.
> 
> With the patch, the output is:
> 
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  Could not get physical processor id - cannot set processor affinity
>  --> Returned "Not found" (-5) instead of "Success" (0)
> --------------------------------------------------------------------------
> 
> Without the patch, the output was:
> 
> *** Process received signal ***
> Signal: Segmentation fault (11)
> Signal code: Address not mapped (1)
> Failing at address: 0x10
> [ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
> [ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) 
> [0x7fce74468dfc]
> [ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
> [ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
> [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
> [ 5] ./IMB-MPI1 [0x403499]
> 
> 
> Regards,
> Guillaume
> 
> ---
> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
> --- a/ompi/runtime/ompi_mpi_init.c
> +++ b/ompi/runtime/ompi_mpi_init.c
> @@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
>                 OPAL_PAFFINITY_CPU_ZERO(mask);
>                 phys_cpu = 
> opal_paffinity_base_get_physical_processor_id(nrank);
>                 if (0 > phys_cpu) {
> +                    ret = phys_cpu;
>                     error = "Could not get physical processor id - cannot set 
> processor affinity";
>                     goto error;
>                 }
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to