This is a paper from back in 2004, and some of the details have definitely changed since then, but the high-level concepts are likely still to be a good overview of how Open MPI's coll selection process works:
http://www.open-mpi.org/papers/ics-2004/ > On Oct 6, 2015, at 9:06 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Dahai Guo, > > > > > > On 10/7/2015 3:08 AM, Dahai Guo wrote: >> Thanks, Jeff. It is very helpful. some more questions :-): >> >> 1. There are many coll components, such as basic, tuned, self, cuda, sm, >> and etc. Are they all selected at the MPI_Init time? or it just select >> those satisfying some criteria, hardware, communicator size? or only some >> specific ones are selected? > some components simply disqualify themselves at MPI_Init time, > and some other components are not selected when a communicator is created > (for example coll/sm cannot be used on an communicator with tasks on several > nodes, > note coll_sm_priority is zero by default, so coll/sm is disqualified at > MPI_Init time and hence > this is not a perfect example) > > ompi/mpi/c/barrier.c uses a function pointer to the barrier subroutine, > and this function pointer has been set when the communicator was created. >> >> 2. MPI_Barrier seems choose the exact algorithm for the API in MPI_Init, >> since I checked the file ompi/mpi/c/barrier.c, and there is no choice >> except inter/intra judge. Would you please point out in which code it is >> selected? So that I can get some hint for other MPI collective functions >> selection, and . >> > most of the time, coll/tuned is used. > and most of the time, coll_tuned_XXX_intra_dec_fixed is used > this function will choose the collective algorithm to be used. > > for example, MPI_Barrier invokes ompi_coll_tuned_barrier_intra_dec_fixed, > which will choose and invoke one of : > - ompi_coll_base_barrier_intra_two_procs > - ompi_coll_base_barrier_intra_bruck > - ompi_coll_base_barrier_intra_recursivedoubling > > the best way to understand this part is probably to use a debugger, set a > breakpoint in MPI_Barrier and step as long as required. > in most cases, you will end up using the coll/tuned module on an > intra-communicator, so unless you plan to develop your own collective module, > you can skip the module initialization/selection part. >> 3. I saw somewhere the run-time parameters to choose algorithms, such as >> "--mca coll_tuned_reduce_algorithm 5". Where can I find the complete list of >> these kinds of runtime options and their value choices? >> > you can run ompi_info --all and search for coll_tuned_xxx_algorithm > for example > MCA coll: parameter "coll_tuned_barrier_algorithm" (current value: "ignore", > data source: default, level: 5 tuner/detail, type: int) > Which barrier algorithm is used. Can be locked down > to choice of: 0 ignore, 1 linear, 2 double ring, 3: recursive doubling 4: > bruck, 5: two proc only, 6: tree > Valid values: 0:"ignore", 1:"linear", > 2:"double_ring", 3:"recursive_doubling", 4:"bruck", 5:"two_proc", 6:"tree" > > if you want to force the usage of the bruck (4) algorithm, you can run > mpirun --mca coll_tuned_use_dynamic_rules 1 --mca > coll_tuned_barrier_algorithm 4 ... > > Cheers, > > Gilles > >> Dahai >> >> >> >> On Tuesday, October 6, 2015 12:25 PM, Jeff Squyres (jsquyres) >> <jsquy...@cisco.com> wrote: >> >> >> On Oct 6, 2015, at 10:19 AM, Dahai Guo <dahaiguo2...@yahoo.com> wrote: >> > >> > Thanks, Gilles. Some more questions: >> > >> > 1. how does Open MPI define the priorities of the different collective >> > components? what criteria is based on? >> >> The priorities are in the range of [0, 100] (100=highest). The priorities >> tend to be fairly coarse-grained; they're mainly based on relative knowledge >> of how good / bad a particular algorithm is going to be. >> >> > 2. how does a MPI collective function (MPI_Barrier for example) choose the >> > exact algorithm it use? based on message size, and communicator size? any >> > other factors? >> >> Yes (all of the above). Meaning: each component is responsible for a) >> determining whether it will provide a function pointer for each operation, >> and b) what that function pointer's priority should be (same disclaimer as >> my last mail: I don't remember offhand if there's a single priority for the >> whole component, or on a per-function-pointer/operation basis). >> >> Hence, the component can use whatever criteria it wants to determine if it >> wants to provide a function pointer or not. E.g., if it only has algorithms >> that work with communicators that have a size that is a power of 2, then it >> can use that information to determine whether it wants to provide a function >> pointer for a new communicator or not. >> >> > 3. when does MPI_Barrier choose the algorithm? in ompi_mpi_init? or >> > every time the API program calls the MPI_barrier? >> >> A combination of: when the communicator is constructed and when the barrier >> is run. >> >> I already described the communicator-constructor scenario. But in addition >> to that, it's certainly possible to have a collective operation dispatch to >> a function that makes a further run-time based decision (the tuned >> collective component does a lot of this). >> >> For barrier that wouldn't really be necessary (because you can setup >> everything at communicator constructor time because the MPI_BCAST API >> doesn't have any variation in its parameters -- i.e., you know everything at >> communicator constructor time). But for other operations, you might choose >> different algorithms depending on the number of local peers, the size of the >> message, ...etc. Hence, you might want to make the final algorithm dispatch >> decision when MPI_GATHER is invoked with the final set of parameters, etc. >> >> > 4. all the MPI collective functions follow the same procedure to choose >> > algorithms in the API program? >> >> I'm not sure how to parse this question. >> >> In general, all MPI collective operations follow the same procedure to >> select which component is selected at communicator constructor time. When >> the collective operation is dispatched off to the module at run time (e.g., >> when MPI_BCAST is invoked), it's then up to the module to decide what to do >> next (i.e., how to actually effect that collective operation). >> >> > It would be great if you can point out some main OMPI files and functions >> > that are involved in the process. >> >> You might want to step through the selection process with a debugger to see >> what happens. Set a breakpoint on mca_coll_base_comm_select() and step >> through from there. >> >> >> > Dahai >> > >> > >> > >> > On Tuesday, October 6, 2015 1:08 AM, Gilles Gouaillardet >> > <gilles.gouaillar...@gmail.com> wrote: >> > >> > >> > at first, you can check the priorities of the various coll modules >> > with ompi_info >> > >> > $ ompi_info --all | grep \"coll_ | grep priority >> > MCA coll: parameter "coll_basic_priority" (current >> > value: "10", data source: default, level: 9 dev/all, type: int) >> > MCA coll: parameter "coll_inter_priority" (current >> > value: "40", data source: default, level: 9 dev/all, type: int) >> > MCA coll: parameter "coll_libnbc_priority" (current >> > value: "10", data source: default, level: 9 dev/all, type: int) >> > MCA coll: parameter "coll_ml_priority" (current value: >> > "0", data source: default, level: 9 dev/all, type: int) >> > MCA coll: parameter "coll_self_priority" (current >> > value: "75", data source: default, level: 9 dev/all, type: int) >> > MCA coll: parameter "coll_sm_priority" (current value: >> > "0", data source: default, level: 9 dev/all, type: int) >> > MCA coll: parameter "coll_tuned_priority" (current >> > value: "30", data source: default, level: 6 tuner/all, type: int) >> > >> > >> > coll_tuned_priority likely the collective module you will be using. >> > then you can check the various ompi_coll_tuned_*_intra_dec_fixed functions >> > in >> > ompi/mca/coll/tuned/coll_tuned_decision_fixed.c >> > this is how the tuned collective module selects algorithms based on >> > communicator size and message size. >> > >> > Cheers, >> > >> > Gilles >> > >> > On Sun, Oct 4, 2015 at 11:12 AM, Dahai Guo <dahaiguo2...@yahoo.com> wrote: >> > > Thanks, Jeff. I am trying to understand in detail how Open MPI works in >> > > the >> > > run time. What main functions does it call to select and initialize the >> > > coll >> > > components? Using the "helloworld" as an example, how does it select and >> > > initialize the MPI_Barrier algorithm? which C functions are involved and >> > > used in the process? >> > > >> > > Dahai >> > > >> > > >> > > >> > > On Friday, October 2, 2015 7:50 PM, Jeff Squyres (jsquyres) >> > > <jsquy...@cisco.com> wrote: >> > > >> > > >> > > On Oct 2, 2015, at 2:21 PM, Dahai Guo <dahaiguo2...@yahoo.com> wrote: >> > >> >> > >> Is there any way to trace open mpi internal function calls in a MPI user >> > >> program? >> > > >> > > Unfortunately, not easily -- other than using a debugger, for example. >> > > >> > >> If so, can any one explain it with an example? such as helloworld? I >> > >> build open MPI with the VampirTrace options, and compile the following >> > >> program with picc-vt,. but I didn't get any tracing info. >> > > >> > > Open MPI is a giant state machine -- MPI_INIT, for example, invokes >> > > slightly >> > > fewer than a bazillion functions (e.g., it initializes every framework >> > > and >> > > many components/plugins). >> > > >> > > Is there something in particular that you're looking for / want to know >> > > about? >> > > >> > >> Thanks >> > >> >> > >> D. G. >> > >> >> > >> #include <stdio.h> >> > >> #include <mpi.h> >> > >> >> > >> >> > >> int main (int argc, char **argv) >> > >> { >> > >> int rank, size; >> > >> >> > >> MPI_Init (&argc, &argv); >> > >> MPI_Comm_rank (MPI_COMM_WORLD, &rank); >> > >> MPI_Comm_size (MPI_COMM_WORLD, &size); >> > >> printf( "Hello world from process %d of %d\n", rank, size ); >> > >> MPI_Barrier(MPI_COMM_WORLD); >> > >> MPI_Finalize(); >> > >> return 0; >> > >> } >> > >> >> > >> _______________________________________________ >> > >> devel mailing list >> > >> de...@open-mpi.org >> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >> Link to this post: >> > >> http://www.open-mpi.org/community/lists/devel/2015/10/18125.php >> > > >> > > >> > > -- >> > > Jeff Squyres >> > > jsquy...@cisco.com >> > > For corporate legal information go to: >> > > http://www.cisco.com/web/about/doing_business/legal/cri/ >> > >> > > >> > > >> > > >> > > _______________________________________________ >> > > devel mailing list >> > > de...@open-mpi.org >> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > Link to this post: >> > >> > > http://www.open-mpi.org/community/lists/devel/2015/10/18138.php >> > >> > >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> > http://www.open-mpi.org/community/lists/devel/2015/10/18140.php >> >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> >> >> _______________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> >> Subscription: >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18143.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18147.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/