I, too, have tried various builds of the rc4 release. It's dying during orterun.

Specifically, here's the call chain where things fall apart:

orterun -> orte_init -> opal_init -> opal_carto_base_select -> mca_base_select

54     for (item  = opal_list_get_first(components_available);
55         item != opal_list_get_end(components_available);
56          item  = opal_list_get_next(item) ) {
57         cli = (mca_base_component_list_item_t *) item;
58         component = (mca_base_component_t *) cli->cli_component;

The code is failing on line #55, i.e. item must be getting set to the end on the first pass through. The code then jumps to line #107 and passes the NULL test there:

107    if (NULL == *best_component) {
108         opal_output_verbose(5, output_id,
109 "mca:base:select:(%5s) No component selected!",
110                             type_name);
111         /*
112          * Still close the non-selected components
113          */
114 mca_base_components_close(0, /* Pass 0 to keep this from closing the output handle */
115                                   components_available,
116                                   NULL);
117         return OPAL_ERR_NOT_FOUND;
118     }


-david
--
David Gunter
HPC-3: Infrastructure Team
Los Alamos National Laboratory



Sam Gutierrez wrote:

>   Hi All,

>  I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I
>  try to launch a simple MPI job, I get the following:

>  [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for
>  carto components
> [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto
>  components
>  [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
>  components
>  [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
>  selected!
> -------------------------------------------------------------------------- > It looks like opal_init failed for some reason; your parallel process is
>  likely to abort. There are many reasons that a parallel process can
>  fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
>  here's some additional information (which may only be relevant to an
>  Open MPI developer):

>     opal_carto_base_select failed
>     --> Returned value -13 instead of OPAL_SUCCESS
> --------------------------------------------------------------------------
>  [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>  found in file runtime/orte_init.c at line 77
>  [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>  found in file orterun.c at line 541

>  This may be an issue on our end regarding a runtime parameter that
>  isn't set correctly. See attached. Please let me know if you need
>  any more info.

>  Thanks!

>  --
Samuel K. Gutierrez
Los Alamos National Laboratory



On Nov 4, 2009, at 3:00 PM, Jeff Squyres wrote:
> The latest-n-greatest is available here:
>
> http://www.open-mpi.org/software/ompi/v1.3/
>
> Please beat it up and look for problems!
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to