Excellent! Thanks Josh - both for the original work/commit and for the quick
fix!

Ralph


On 5/6/08 3:58 PM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

> Sorry about that. Looking back at the filem logic it seems that I
> returned success even if select failed (and just use the 'none'
> passthrough component). I committed a patch in r18389 that fixes this
> problem.
> 
> This commit now has a warning that prints on the filem verbose stream
> so if a user hits something like this in the wild unexpectedly then
> we can help them debug it a bit.
> 
> Cheers,
> Josh
> 
> 
> On May 6, 2008, at 2:56 PM, Ralph H Castain wrote:
> 
>> Hmmm....well, I hit a problem (of course!). I have mca-no-build on
>> the filem
>> framework on my Mac. If I just mpriun -n 3 ./hello, I get the
>> following
>> error:
>> 
>> ----------------------------------------------------------------------
>> ----
>> It looks like orte_init failed for some reason; your parallel
>> process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>   orte_filem_base_select failed
>>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>> 
>> ----------------------------------------------------------------------
>> ----
>> 
>> After looking at the source code for filem_select, I can run just
>> fine if I
>> specify -mca filem none on the cmd line. Otherwise, it looks like your
>> select logic insists that at least one component must be built and
>> selectable?
>> 
>> Is that generally true, or is your filem framework the exception? I
>> think
>> this would not be a good general requirement - frankly, I don't
>> think it is
>> good for any framework to have such a requirement.
>> 
>> Ralph
>> 
>> 
>> 
>> On 5/6/08 12:09 PM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:
>> 
>>> This has been committed in r18381
>>> 
>>> Please let me know if you have any problems with this commit.
>>> 
>>> Cheers,
>>> Josh
>>> 
>>> On May 5, 2008, at 10:41 AM, Josh Hursey wrote:
>>> 
>>>> Awesome.
>>>> 
>>>> The branch is updated to the latest trunk head. I encourage folks to
>>>> check out this repository and make sure that it builds on their
>>>> system. A normal build of the branch should be enough to find out if
>>>> there are any cut-n-paste problems (though I tried to be careful,
>>>> mistakes do happen).
>>>> 
>>>> I haven't heard any problems so this is looking like it will come in
>>>> tomorrow after the teleconf. I'll ask again there to see if there
>>>> are
>>>> any voices of concern.
>>>> 
>>>> Cheers,
>>>> Josh
>>>> 
>>>> On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:
>>>> 
>>>>> This all sounds good to me!
>>>>> 
>>>>> On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:
>>>>> 
>>>>>> What:  Add mca_base_select() and adjust frameworks & components to
>>>>>> use
>>>>>> it.
>>>>>> Why:   Consolidation of code for general goodness.
>>>>>> Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
>>>>>> When:  Code ready now. Documentation ready soon.
>>>>>> Timeout: May 6, 2008 (After teleconf) [1 week]
>>>>>> 
>>>>>> Discussion:
>>>>>> -----------
>>>>>> For a number of years a few developers have been talking about
>>>>>> creating a MCA base component selection function. For various
>>>>>> reasons
>>>>>> this was never implemented. Recently I decided to give it a try.
>>>>>> 
>>>>>> A base select function will allow Open MPI to provide completely
>>>>>> consistent selection behavior for many of its frameworks (18 of 31
>>>>>> to
>>>>>> be exact at the moment). The primary goal of this work is to
>>>>>> improving
>>>>>> code maintainability through code reuse. Other benefits also
>>>>>> result
>>>>>> such as a slightly smaller memory footprint.
>>>>>> 
>>>>>> The mca_base_select() function represented the most commonly used
>>>>>> logic for component selection: Select the one component with the
>>>>>> highest priority and close all of the not selected components.
>>>>>> This
>>>>>> function can be found at the path below in the branch:
>>>>>> opal/mca/base/mca_base_components_select.c
>>>>>> 
>>>>>> To support this I had to formalize a query() function in the
>>>>>> mca_base_component_t of the form:
>>>>>> int mca_base_query_component_fn(mca_base_module_t **module, int
>>>>>> *priority);
>>>>>> 
>>>>>> This function is specified after the open and close component
>>>>>> functions in this structure as to allow compatibility with
>>>>>> frameworks
>>>>>> that do not use the base selection logic. Frameworks that do *not*
>>>>>> use
>>>>>> this function are *not* effected by this commit. However, every
>>>>>> component in the frameworks that use the mca_base_select function
>>>>>> must
>>>>>> adjust their component query function to fit that specified above.
>>>>>> 
>>>>>> 18 frameworks in Open MPI have been changed. I have updated all of
>>>>>> the
>>>>>> components in the 18 frameworks available in the trunk on my
>>>>>> branch.
>>>>>> The effected frameworks are:
>>>>>> - OPAL Carto
>>>>>> - OPAL crs
>>>>>> - OPAL maffinity
>>>>>> - OPAL memchecker
>>>>>> - OPAL paffinity
>>>>>> - ORTE errmgr
>>>>>> - ORTE ess
>>>>>> - ORTE Filem
>>>>>> - ORTE grpcomm
>>>>>> - ORTE odls
>>>>>> - ORTE pml
>>>>>> - ORTE ras
>>>>>> - ORTE rmaps
>>>>>> - ORTE routed
>>>>>> - ORTE snapc
>>>>>> - OMPI crcp
>>>>>> - OMPI dpm
>>>>>> - OMPI pubsub
>>>>>> 
>>>>>> There was a question of the memory footprint change as a result of
>>>>>> this commit. I used 'pmap' to determine process memory footprint
>>>>>> of a
>>>>>> hello world MPI program. Static and Shared build numbers are below
>>>>>> along with variations on launching locally and to a single node
>>>>>> allocated by SLURM. All of this was on Indiana University's Odin
>>>>>> machine. We compare against the trunk (r18276) representing the
>>>>>> last
>>>>>> SVN sync point of the branch.
>>>>>> 
>>>>>>  Process(shared)| Trunk    | Branch  | Diff (Improvement)
>>>>>>  ---------------+----------+---------+-------
>>>>>>  mpirun (orted) |   39976K |  36828K | 3148K
>>>>>>  hello (0)      |  229288K | 229268K |   20K
>>>>>>  hello (1)      |  229288K | 229268K |   20K
>>>>>>  ---------------+----------+---------+-------
>>>>>>  mpirun         |   40032K |  37924K | 2108K
>>>>>>  orted          |   34720K |  34660K |   60K
>>>>>>  hello (0)      |  228404K | 228384K |   20K
>>>>>>  hello (1)      |  228404K | 228384K |   20K
>>>>>> 
>>>>>>  Process(static)| Trunk    | Branch  | Diff (Improvement)
>>>>>>  ---------------+----------+---------+-------
>>>>>>  mpirun (orted) |   21384K |  21372K |  12K
>>>>>>  hello (0)      |  194000K | 193980K |  20K
>>>>>>  hello (1)      |  194000K | 193980K |  20K
>>>>>>  ---------------+----------+---------+-------
>>>>>>  mpirun         |   21384K |  21372K |  12K
>>>>>>  orted          |   21208K |  21196K |  12K
>>>>>>  hello (0)      |  193116K | 193096K |  20K
>>>>>>  hello (1)      |  193116K | 193096K |  20K
>>>>>> 
>>>>>> As you can see there are some small memory footprint
>>>>>> improvements on
>>>>>> my branch that result from this work. The size of the Open MPI
>>>>>> project
>>>>>> shrinks a bit as well. This commit cuts between 3,500 and 2,000
>>>>>> lines
>>>>>> of code (depending on how you count) so about a ~1% code shrink.
>>>>>> 
>>>>>> The branch is stable in all of the testing I have done, but there
>>>>>> are
>>>>>> some platforms on which I cannot test. So please give this
>>>>>> branch a
>>>>>> try and let me know if you find any problems.
>>>>>> 
>>>>>> Cheers,
>>>>>> Josh
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to