Sounds strange - the locality is definitely being set in the code. Can you run it with -mca hwloc_base_verbose 5 --display-map? Should tell us where it thinks things are running, and what locality it is recording.
On Jul 3, 2012, at 11:54 AM, Juan Antonio Rico Gallego wrote: > Hello everyone. Maybe you can help me: > > I got a subversion (r 26725) from the developers trunk. I configure with: > > ../../onecopy/ompi-trunk/configure > --prefix=/home/jarico/shared/packages/openmpi-cas-dbg --disable-shared > --enable-static --enable-debug --enable-mem-profile --enable-mem-debug > CFLAGS=-g > > Compiling is ok, but when I try to run in a shared memory machine with the SM > component: > > /home/jarico/shared/packages/openmpi-cas-dbg/bin/mpiexec --mca > mca_base_verbose 100 --mca mca_coll_base_output 100 --mca coll sm,self --mca > coll_sm_priority 99 -n 2 ./bcast > > I get the error message: > > > -------------------------------------------------------------------------- > Although some coll components are available on your system, none of > them said that they could be used for a new communicator. > > This is extremely unusual -- either the "basic" or "self" components > should be able to be chosen for any communicator. As such, this > likely means that something else is wrong (although you should double > check that the "basic" and "self" coll components are available on > your system -- check the output of the "ompi_info" command). > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > mca_coll_base_comm_select(MPI_COMM_WORLD) failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > [Metropolis-01:15120] *** An error occurred in MPI_Init > [Metropolis-01:15120] *** reported by process [3914661889,0] > [Metropolis-01:15120] *** on a NULL communicator > [Metropolis-01:15120] *** Unknown error > [Metropolis-01:15120] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [Metropolis-01:15120] *** and potentially your MPI job) > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: Metropolis-01 > PID: 15120 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec has exited due to process rank 0 with PID 15120 on > node Metropolis-01 exiting improperly. There are three reasons this could > occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter > orte_create_session_dirs is set to false. In this case, the run-time cannot > detect that the abort call was an abnormal termination. Hence, the only > error message you will receive is this one. > > This may have caused other processes in the application to be > terminated by signals sent by mpiexec (as reported here). > > You can avoid this message by specifying -quiet on the mpiexec command line. > > -------------------------------------------------------------------------- > [Metropolis-01:15119] 1 more process has sent help message help-mca-coll-base > / comm-select:none-available > [Metropolis-01:15119] Set MCA parameter "orte_base_help_aggregate" to 0 to > see all help / error messages > [Metropolis-01:15119] 1 more process has sent help message help-mpi-runtime / > mpi_init:startup:internal-failure > [Metropolis-01:15119] 1 more process has sent help message > help-mpi-errors.txt / mpi_errors_are_fatal unknown handle > [Metropolis-01:15119] 1 more process has sent help message > help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed > [jarico@Metropolis-01 examples]$ > > > > It seems a problem choosing SM component because of the locality of the > processes. The mca_coll_sm_init_query function returns > OMPI_ERR_NOT_AVAILABLE. > I remember that in previous releases (about 26206) I needed to change a > little the ompi_proc_init() function, adding the lines: > > } else { > /* get the locality information */ > proc->proc_flags = > orte_ess.proc_get_locality(&proc->proc_name); > /* get the name of the node it is on */ > proc->proc_hostname = > orte_ess.proc_get_hostname(&proc->proc_name); > } > > > enough for running ok. But this function has changed and this code does not > work. I am not sure now what I am doing bad. > > Thanks for your time, > Juan A. Rico > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel