Re: [OMPI devel] SM component init unload

Ralph Castain Tue, 3 Jul 2012 14:29:05 -0400

Sounds strange - the locality is definitely being set in the code. Can you run 
it with -mca hwloc_base_verbose 5 --display-map? Should tell us where it thinks 
things are running, and what locality it is recording.



On Jul 3, 2012, at 11:54 AM, Juan Antonio Rico Gallego wrote:

> Hello everyone. Maybe you can help me:
> 
> I got a subversion (r 26725) from the developers trunk. I configure with:
> 
> ../../onecopy/ompi-trunk/configure 
> --prefix=/home/jarico/shared/packages/openmpi-cas-dbg --disable-shared 
> --enable-static --enable-debug --enable-mem-profile --enable-mem-debug 
> CFLAGS=-g
> 
> Compiling is ok, but when I try to run in a shared memory machine with the SM 
> component:
> 
> /home/jarico/shared/packages/openmpi-cas-dbg/bin/mpiexec --mca 
> mca_base_verbose 100 --mca mca_coll_base_output 100 --mca coll sm,self --mca 
> coll_sm_priority 99  -n 2 ./bcast
> 
> I get the error message:
> 
> 
> --------------------------------------------------------------------------
> Although some coll components are available on your system, none of
> them said that they could be used for a new communicator.
> 
> This is extremely unusual -- either the "basic" or "self" components
> should be able to be chosen for any communicator.  As such, this
> likely means that something else is wrong (although you should double
> check that the "basic" and "self" coll components are available on
> your system -- check the output of the "ompi_info" command).
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   mca_coll_base_comm_select(MPI_COMM_WORLD) failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [Metropolis-01:15120] *** An error occurred in MPI_Init
> [Metropolis-01:15120] *** reported by process [3914661889,0]
> [Metropolis-01:15120] *** on a NULL communicator
> [Metropolis-01:15120] *** Unknown error
> [Metropolis-01:15120] *** MPI_ERRORS_ARE_FATAL (processes in this 
> communicator will now abort,
> [Metropolis-01:15120] ***    and potentially your MPI job)
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
>   Reason:     Before MPI_INIT completed
>   Local host: Metropolis-01
>   PID:        15120
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 0 with PID 15120 on
> node Metropolis-01 exiting improperly. There are three reasons this could 
> occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
> orte_create_session_dirs is set to false. In this case, the run-time cannot
> detect that the abort call was an abnormal termination. Hence, the only
> error message you will receive is this one.
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> 
> You can avoid this message by specifying -quiet on the mpiexec command line.
> 
> --------------------------------------------------------------------------
> [Metropolis-01:15119] 1 more process has sent help message help-mca-coll-base 
> / comm-select:none-available
> [Metropolis-01:15119] Set MCA parameter "orte_base_help_aggregate" to 0 to 
> see all help / error messages
> [Metropolis-01:15119] 1 more process has sent help message help-mpi-runtime / 
> mpi_init:startup:internal-failure
> [Metropolis-01:15119] 1 more process has sent help message 
> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
> [Metropolis-01:15119] 1 more process has sent help message 
> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
> [jarico@Metropolis-01 examples]$ 
> 
> 
> 
> It seems a problem choosing SM component because of the locality of the 
> processes. The mca_coll_sm_init_query function returns 
> OMPI_ERR_NOT_AVAILABLE. 
> I remember that in previous releases (about 26206) I needed to change a 
> little the ompi_proc_init() function, adding the lines:
> 
>         } else {
>                 /* get the locality information */
>                 proc->proc_flags = 
> orte_ess.proc_get_locality(&proc->proc_name);
>                 /* get the name of the node it is on */
>                 proc->proc_hostname = 
> orte_ess.proc_get_hostname(&proc->proc_name);
>         }
> 
> 
> enough for running ok. But this function has changed and this code does not 
> work. I am not sure now what I am doing bad.
> 
> Thanks for your time,
> Juan A. Rico
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] SM component init unload

Reply via email to