Re: [OMPI devel] Changes: opal_output and opal_show_help

Jeff Squyres Sat, 10 May 2008 08:59:41 -0400

BTW, if you want to look at the code and changelog, it's here:


    http://www.open-mpi.org/hg/hgwebdir.cgi/rhc/channel/

FWIW, the short answer to many of your questions is: orte_show_help()and orte_output() are almost identical to their OPAL counterparts.They have a few extensions to provide simple and desirable semanticsthat make the output scheme more cohesive to a job (vs. a single OPALprocess). They also allow better integration with 3rd party tools byseparating user MPI process output from OMPI infrastructure output.

FWIW: the "eliminate duplicate help messages" functionality is a sideeffect of these goals; I seized upon the opportunity to implement itwith the new infrastructure because I've wanted this feature for along, long, looooong time. :-)


Here's a brief comparison of the two:

- opal_output() and opal_show_help() output to their process' stdoutand/or stderr streams (indistinguishable from other stdout/stderrdata). stdout/stderr from MPI processes is picked up by the orted'sIOF and RML sent to the HNP, and then output as part of the HNP'sstdout/stderr streams.

- orte_output() and orte_show_help() render the final message stringand then RML send it to the HNP. The HNP may apply some job-levelintelligence (e.g., adding XML tagging based on who the message camefrom, eliminating duplicate show_help messages, etc.) beforeoutputting it somewhere, such as the HNP's stdout and/or stderr.


More details below.


On May 9, 2008, at 6:37 PM, George Bosilca wrote:

1) If I correctly understood, we need the orte_output andorte_show_help in order to be able to make a difference between theapplication stdout/stderr and the MPI library ones ? Who is applyingthe filter ? The local daemon or the HNP ? How do we make sure thatthe remote outputs are not interlaced ?

orte_show_help() and orte_output() messages are sent via RML to theHNP (which is almost exactly what the IOF does). The HNP listens forthese RML messages and then displays them on the HNP stdout/stderr,conditionally applying component-based filtering (e.g., sending theoutput into a tagged XML stream, or whatever the filter componentdoes), and in the case of show_help messages, checking to see whetherthe message has been displayed already or not.

orte_show_help() messages are always sent to the HNP stderr (just likeopal_show_help messages are always sent to stderr).

orte_output() messages are sent to the HNP's stdout and/or stderr,depending on what was requested when the stream was opened (viaorte_output_open(), quite similar to opal_output_open()).


<sidenote>

We had a long discussion about using the IOF for all of this stuff(IOF has tags that could have been used). We ended up creating newRML sends instead for the following reasons:

- The idea of the IOF is great, but the implementation is lacking inmany ways. We have several open tickets on the IOF, each of whichwill require a *lot* of work to fix. We did not want to undertakethat right now; there are many complex issues involved in fixing theIOF.

- The argument that swayed me was that for orte_output() andorte_show_help(), the IOF adds little functionality over vanilla RML.In both the RML and IOF, you'd end up adding a callback handler to becalled back when an output() or show_help() message arrives in theHNP. So why include all the (complex and known buggy) IOFinfrastructure? Note: the functionality that we give up by using theRML instead of the IOF is what is not implemented in the IOF anyway(yet?) -- multiplexing output to multiple destinations. Perhapssomeday the IOF will be fixed and it will be worthwhile, but probablynot any time soon.


</sidenote>

2) Who is really generating the error message ?

Excellent question: keep in mind the difference between *generating*(or rendering) the error message and *outputting* the error message.

Whoever calls orte_output() or orte_show_help() renders the message(usually MPI processes, but even non-MPI-process entities such asorterun can invoke orte_show_help). The message is then sent to theHNP to be output.

In your item #2 I wonder how do you make the difference between whatneed to be printed once (such as the PML initialization error) andwhat is supposed to be printed multiple times (such as BTL TCPconnection failure) ?

Only orte_show_help() messages are checked for duplicates.orte_output() are always output -- just like opal_output().

Note that show_help() messages are already uniquely identified by(filename, topic) tuples. So the MPI process renders the help messageinto a single string, and then RML sends this string along with the(filename, topic) tuple to the HNP. The HNP examines the (filename,topic) tuple to determine if a similar message has already beenprinted (remember that we have printf-like expansion in the renderingof show_help messages, so it's unlikely that two messages of the same(filename, topic) tuple will be *exactly* the same, because wefrequently include the hostname or other process-specific data in themessage). If a similar message has already been printed, the countfor that (filename, topic) tuple is simply incremented and anaggregate "Hey, I got N duplicates" notice is printed every ~5 seconds(libevent is your friend) or when the HNP terminates -- whicheveroccurs first.

If the HPN is managing these error messages, this will force us toalways install all error files, otherwise this approach cannot workon an heterogeneous environment (such as the local installationdoesn't have infiniband support but the remote one include it).

The specific error message is rendered in the MPI process not only forthe reason you cite, but also because it's the only one that has allthe relevant data (potentially to include the helpfile). We couldhave bundled up all the printf-varargs and sent them to the HNP, butit didn't seem worth it. It would probably be a bandwidth savings,but a) who cares about optimizing error messages?, and b) that thenfalls into the problem you noted (need all the helpfiles on the HNP).Mebbe someday.

Regardless, the (filename, topic) tuple is still unique and doesn'trequire the actual file to reside in the HNP-visible filesystem. Fromthe HNP's perspective, it's just a pair of text tags.

3) What is the OMPI layer supposed to use ? opal_output ?orte_output ? or maybe ompi_output ?

opal_output / opal_show_help is for serial processes, just like therest of the OPAL layer.


We created these new ORTE functions to have job-specific intelligence.

Ralph and I talked about creating OMPI variants, but we didn't knowwhat kind of additional MPI-specific intelligence would be useful.Got any suggestions?


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] Changes: opal_output and opal_show_help

Reply via email to