This is a paper from back in 2004, and some of the details have definitely 
changed since then, but the high-level concepts are likely still to be a good 
overview of how Open MPI's coll selection process works:

    http://www.open-mpi.org/papers/ics-2004/



> On Oct 6, 2015, at 9:06 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Dahai Guo,
> 
> 
> 
> 
> 
> On 10/7/2015 3:08 AM, Dahai Guo wrote:
>> Thanks, Jeff. It is very helpful. some more questions :-):
>> 
>> 1.  There are many coll components, such as basic, tuned, self, cuda, sm, 
>> and etc.  Are they all selected at the MPI_Init time?  or it just select 
>> those satisfying  some criteria, hardware, communicator size? or only some 
>> specific ones are selected?
> some components simply disqualify themselves at MPI_Init time,
> and some other components are not selected when a communicator is created
> (for example coll/sm cannot be used on an communicator with tasks on several 
> nodes,
> note coll_sm_priority is zero by default, so coll/sm is disqualified at 
> MPI_Init time and hence
> this is not a perfect example)
> 
> ompi/mpi/c/barrier.c uses a function pointer to the barrier subroutine,
> and this function pointer has been set when the communicator was created.
>>   
>> 2.  MPI_Barrier seems choose the exact algorithm for the API in MPI_Init, 
>> since I checked the file ompi/mpi/c/barrier.c, and there is no choice  
>> except inter/intra judge. Would you please point out in which code it is 
>> selected? So that I can get some hint for other MPI collective functions 
>> selection, and .
>> 
> most of the time, coll/tuned is used.
> and most of the time, coll_tuned_XXX_intra_dec_fixed is used
> this function will choose the collective algorithm to be used.
> 
> for example, MPI_Barrier invokes ompi_coll_tuned_barrier_intra_dec_fixed, 
> which will choose and invoke one of :
> - ompi_coll_base_barrier_intra_two_procs
> - ompi_coll_base_barrier_intra_bruck
> - ompi_coll_base_barrier_intra_recursivedoubling
> 
> the best way to understand this part is probably to use a debugger, set a 
> breakpoint in MPI_Barrier and step as long as required.
> in most cases, you will end up using the coll/tuned module on an 
> intra-communicator, so unless you plan to develop your own collective module, 
> you can skip the module initialization/selection part.
>> 3. I saw somewhere  the run-time parameters to choose algorithms, such as 
>> "--mca coll_tuned_reduce_algorithm 5". Where can I find the complete list of 
>> these kinds of runtime options and their value choices?
>> 
> you can run ompi_info --all and search for coll_tuned_xxx_algorithm
> for example
> MCA coll: parameter "coll_tuned_barrier_algorithm" (current value: "ignore", 
> data source: default, level: 5 tuner/detail, type: int)
>                           Which barrier algorithm is used. Can be locked down 
> to choice of: 0 ignore, 1 linear, 2 double ring, 3: recursive doubling 4: 
> bruck, 5: two proc only, 6: tree
>                           Valid values: 0:"ignore", 1:"linear", 
> 2:"double_ring", 3:"recursive_doubling", 4:"bruck", 5:"two_proc", 6:"tree"
> 
> if you want to force the usage of the bruck (4) algorithm, you can run
> mpirun --mca coll_tuned_use_dynamic_rules 1 --mca 
> coll_tuned_barrier_algorithm 4 ...
> 
> Cheers,
> 
> Gilles
> 
>> Dahai 
>> 
>> 
>> 
>> On Tuesday, October 6, 2015 12:25 PM, Jeff Squyres (jsquyres) 
>> <jsquy...@cisco.com> wrote:
>> 
>> 
>> On Oct 6, 2015, at 10:19 AM, Dahai Guo <dahaiguo2...@yahoo.com> wrote:
>> > 
>> > Thanks, Gilles. Some more questions:
>> > 
>> > 1. how does Open MPI  define the priorities of the different collective 
>> > components? what criteria is based on?
>> 
>> The priorities are in the range of [0, 100] (100=highest).  The priorities 
>> tend to be fairly coarse-grained; they're mainly based on relative knowledge 
>> of how good / bad a particular algorithm is going to be.
>> 
>> > 2. how does a MPI collective function (MPI_Barrier for example) choose the 
>> > exact algorithm it use? based on message size, and communicator size? any 
>> > other factors? 
>> 
>> Yes (all of the above).  Meaning: each component is responsible for a) 
>> determining whether it will provide a function pointer for each operation, 
>> and b) what that function pointer's priority should be (same disclaimer as 
>> my last mail: I don't remember offhand if there's a single priority for the 
>> whole component, or on a per-function-pointer/operation basis).
>> 
>> Hence, the component can use whatever criteria it wants to determine if it 
>> wants to provide a function pointer or not.  E.g., if it only has algorithms 
>> that work with communicators that have a size that is a power of 2, then it 
>> can use that information to determine whether it wants to provide a function 
>> pointer for a new communicator or not.
>> 
>> > 3. when does MPI_Barrier choose the algorithm?  in ompi_mpi_init? or  
>> > every time the API program calls the MPI_barrier? 
>> 
>> A combination of: when the communicator is constructed and when the barrier 
>> is run.
>> 
>> I already described the communicator-constructor scenario.  But in addition 
>> to that, it's certainly possible to have a collective operation dispatch to 
>> a function that makes a further run-time based decision (the tuned 
>> collective component does a lot of this).
>> 
>> For barrier that wouldn't really be necessary (because you can setup 
>> everything at communicator constructor time because the MPI_BCAST API 
>> doesn't have any variation in its parameters -- i.e., you know everything at 
>> communicator constructor time).  But for other operations, you might choose 
>> different algorithms depending on the number of local peers, the size of the 
>> message, ...etc.  Hence, you might want to make the final algorithm dispatch 
>> decision when MPI_GATHER is invoked with the final set of parameters, etc.
>> 
>> > 4. all the MPI collective functions follow the same procedure to choose 
>> > algorithms in the API program?
>> 
>> I'm not sure how to parse this question.
>> 
>> In general, all MPI collective operations follow the same procedure to 
>> select which component is selected at communicator constructor time.  When 
>> the collective operation is dispatched off to the module at run time (e.g., 
>> when MPI_BCAST is invoked), it's then up to the module to decide what to do 
>> next (i.e., how to actually effect that collective operation).
>> 
>> > It would be great if you can point out some main OMPI files and functions 
>> > that are involved in the process.
>> 
>> You might want to step through the selection process with a debugger to see 
>> what happens.  Set a breakpoint on mca_coll_base_comm_select() and step 
>> through from there.
>> 
>> 
>> > Dahai
>> > 
>> > 
>> > 
>> > On Tuesday, October 6, 2015 1:08 AM, Gilles Gouaillardet 
>> > <gilles.gouaillar...@gmail.com> wrote:
>> > 
>> > 
>> > at first, you can check the priorities of the various coll modules
>> > with ompi_info
>> > 
>> > $ ompi_info --all | grep \"coll_ | grep priority
>> >                MCA coll: parameter "coll_basic_priority" (current
>> > value: "10", data source: default, level: 9 dev/all, type: int)
>> >                MCA coll: parameter "coll_inter_priority" (current
>> > value: "40", data source: default, level: 9 dev/all, type: int)
>> >                MCA coll: parameter "coll_libnbc_priority" (current
>> > value: "10", data source: default, level: 9 dev/all, type: int)
>> >                MCA coll: parameter "coll_ml_priority" (current value:
>> > "0", data source: default, level: 9 dev/all, type: int)
>> >                MCA coll: parameter "coll_self_priority" (current
>> > value: "75", data source: default, level: 9 dev/all, type: int)
>> >                MCA coll: parameter "coll_sm_priority" (current value:
>> > "0", data source: default, level: 9 dev/all, type: int)
>> >                MCA coll: parameter "coll_tuned_priority" (current
>> > value: "30", data source: default, level: 6 tuner/all, type: int)
>> > 
>> > 
>> > coll_tuned_priority likely the collective module you will be using.
>> > then you can check the various ompi_coll_tuned_*_intra_dec_fixed functions 
>> > in
>> > ompi/mca/coll/tuned/coll_tuned_decision_fixed.c
>> > this is how the tuned collective module selects algorithms based on
>> > communicator size and message size.
>> > 
>> > Cheers,
>> > 
>> > Gilles
>> > 
>> > On Sun, Oct 4, 2015 at 11:12 AM, Dahai Guo <dahaiguo2...@yahoo.com> wrote:
>> > > Thanks, Jeff. I am trying to understand in detail how Open MPI works in 
>> > > the
>> > > run time. What main functions does it call to select and initialize the 
>> > > coll
>> > > components? Using the "helloworld" as an example,  how does it select and
>> > > initialize the MPI_Barrier algorithm?  which C functions are involved and
>> > > used in the process?
>> > >
>> > > Dahai
>> > >
>> > >
>> > >
>> > > On Friday, October 2, 2015 7:50 PM, Jeff Squyres (jsquyres)
>> > > <jsquy...@cisco.com> wrote:
>> > >
>> > >
>> > > On Oct 2, 2015, at 2:21 PM, Dahai Guo <dahaiguo2...@yahoo.com> wrote:
>> > >>
>> > >> Is there any way to trace open mpi internal function calls in a MPI user
>> > >> program?
>> > >
>> > > Unfortunately, not easily -- other than using a debugger, for example.
>> > >
>> > >> If so, can any one explain it with an example? such as helloworld?  I
>> > >> build open MPI with the VampirTrace options, and compile the following
>> > >> program with picc-vt,. but I didn't get any tracing info.
>> > >
>> > > Open MPI is a giant state machine -- MPI_INIT, for example, invokes 
>> > > slightly
>> > > fewer than a bazillion functions (e.g., it initializes every framework 
>> > > and
>> > > many components/plugins).
>> > >
>> > > Is there something in particular that you're looking for / want to know
>> > > about?
>> > >
>> > >> Thanks
>> > >>
>> > >> D. G.
>> > >>
>> > >> #include <stdio.h>
>> > >> #include <mpi.h>
>> > >>
>> > >>
>> > >> int main (int argc, char **argv)
>> > >> {
>> > >>  int rank, size;
>> > >>
>> > >>  MPI_Init (&argc, &argv);
>> > >>  MPI_Comm_rank (MPI_COMM_WORLD, &rank);
>> > >>  MPI_Comm_size (MPI_COMM_WORLD, &size);
>> > >>  printf( "Hello world from process %d of %d\n", rank, size );
>> > >>  MPI_Barrier(MPI_COMM_WORLD);
>> > >>  MPI_Finalize();
>> > >>  return 0;
>> > >> }
>> > >>
>> > >> _______________________________________________
>> > >> devel mailing list
>> > >> de...@open-mpi.org
>> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > >> Link to this post:
>> > >> http://www.open-mpi.org/community/lists/devel/2015/10/18125.php
>> > >
>> > >
>> > > --
>> > > Jeff Squyres
>> > > jsquy...@cisco.com
>> > > For corporate legal information go to:
>> > > http://www.cisco.com/web/about/doing_business/legal/cri/
>> > 
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > devel mailing list
>> > > de...@open-mpi.org
>> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > > Link to this post:
>> > 
>> > > http://www.open-mpi.org/community/lists/devel/2015/10/18138.php
>> > 
>> > 
>> > 
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post: 
>> > http://www.open-mpi.org/community/lists/devel/2015/10/18140.php
>> 
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> 
>> de...@open-mpi.org
>> 
>> Subscription: 
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/10/18143.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/10/18147.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to