Re: [OMPI devel] initial SCTP BTL commit comments?

Jeff Squyres Tue, 13 Nov 2007 10:00:01 -0500

I have no objections to bringing this into the trunk, but I agree thatan .ompi_ignore is probably a good idea at first.

One question that I'd like to have answered is how OMPI decideswhether to use the SCTP BTL or not. If there are SCTP stacksavailable by default in Linux and OS X -- but their performance may besub-optimal and/or buggy, we may want to have the SCTP BTL onlyactivated if the user explicitly asks for it. Open MPI is veryconcerned with "out of the box" behavior -- we need to ensure that"mpirun a.out" will "just work" on all of our supported platforms.


Will UBC setup regular MTT runs to test the SCTP stuff?  :-)

More below.


On Nov 10, 2007, at 9:25 PM, Brad Penoff wrote:

Currently, both the one-to-one and the one-to-many make use of the
event library offered by Open MPI.  The callback functions for the
one-to-many style however are quite unique as multiple endpoints may
be interested in the events that poll returns. Currently we usetheseunique callback functions, but in the future the hope is to playwith
the potential benefits of a btl_progress function, particularly for
the one-to-many style.
In my experience the event callbacks have a high overhead comparedto a
progress function, so I'd say thats definitely worth checking out.


We noticed that poll is only called after a timer goes off while
btl_progress would be called with each iteration of opal_progress, so
noticing that along with you encouragement makes us want to check it
out even more.

Be aware that based on discussions from the Paris meeting, somechanges to libevent are coming (I really need to get this on a wikipage or something). Here's a quick summary:

- We're waiting for a new release of libevent (or libev -- we'll seehow it shakes out) that has lots of bug fixes and performanceimprovements as compared to the version we currently have in the OMPItree. Based on some libevent mailing list traffic, this release maybe in Dec 2007. We'll see what happens.

- After we update libevent, we'll be making a policy change w.r.t.OMPI progress functions and timer callbacks: only software layers withactual devices will be allowed to register progress functions (inparticular, the io and osd framework progress functions will beeliminated; see below). All other progress-requiring functions willhave to use timers. This means that every time we call progress, we*only* call the stuff that needs to be polled as frequently aspossible. We'll call the less-important progress stuff lessfrequently (e.g., ORTE OOB/RML).

- We'll be changing our use of libevent to utilize the more scalablepolling capabilities (such as epoll and friends). We don't use themright now because on all OS's that we currently care about (Linux, OSX, Solaris), mixing the scalable fd polling mechanism with pty'sresults in Very Very Bad Things. We'll special case where pty's areused and only use select/poll there, and then use epoll (etc.)elsewhere.

- We'll also be changing our use of libevent to utilized timersproperly.

- ompi_request_t will be augmented to have a callback that, if non-NULL, will be invoked when the request is completed. This will allowremoving the io and osd framework progress functions.

- We may also add a high-performance clock framework in Open MPI -- away of accessing high-resolution timers and clocks on the host (e.g.,on Intel chips, additional algorithms are necessary to normalize theper-chip clocks between sockets, especially if a process bouncesbetween sockets -- unnecessary on AMD, PPC, and SPARC platforms).This could improve performance and precision of the libevent timers.

- Finally, registering progress functions will take a new parameter: afile descriptor. If a file descriptor is provided and opal_progress()decides that it wants to block (specific mechanism TBD, but probablysomething similar to what other hybrid polling/blocking systems do:poll for a while, and if nothing "interesting" happens, block) *and*if all registered progress functions have valid fd's, then we'll blockuntil either a timer expires or something "interesting" happens.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] initial SCTP BTL commit comments?

Reply via email to