BTW, if you want to look at the code and changelog, it's here:

    http://www.open-mpi.org/hg/hgwebdir.cgi/rhc/channel/

FWIW, the short answer to many of your questions is: orte_show_help() and orte_output() are almost identical to their OPAL counterparts. They have a few extensions to provide simple and desirable semantics that make the output scheme more cohesive to a job (vs. a single OPAL process). They also allow better integration with 3rd party tools by separating user MPI process output from OMPI infrastructure output.

FWIW: the "eliminate duplicate help messages" functionality is a side effect of these goals; I seized upon the opportunity to implement it with the new infrastructure because I've wanted this feature for a long, long, looooong time. :-)

Here's a brief comparison of the two:

- opal_output() and opal_show_help() output to their process' stdout and/or stderr streams (indistinguishable from other stdout/stderr data). stdout/stderr from MPI processes is picked up by the orted's IOF and RML sent to the HNP, and then output as part of the HNP's stdout/stderr streams.

- orte_output() and orte_show_help() render the final message string and then RML send it to the HNP. The HNP may apply some job-level intelligence (e.g., adding XML tagging based on who the message came from, eliminating duplicate show_help messages, etc.) before outputting it somewhere, such as the HNP's stdout and/or stderr.

More details below.


On May 9, 2008, at 6:37 PM, George Bosilca wrote:

1) If I correctly understood, we need the orte_output and orte_show_help in order to be able to make a difference between the application stdout/stderr and the MPI library ones ? Who is applying the filter ? The local daemon or the HNP ? How do we make sure that the remote outputs are not interlaced ?

orte_show_help() and orte_output() messages are sent via RML to the HNP (which is almost exactly what the IOF does). The HNP listens for these RML messages and then displays them on the HNP stdout/stderr, conditionally applying component-based filtering (e.g., sending the output into a tagged XML stream, or whatever the filter component does), and in the case of show_help messages, checking to see whether the message has been displayed already or not.

orte_show_help() messages are always sent to the HNP stderr (just like opal_show_help messages are always sent to stderr).

orte_output() messages are sent to the HNP's stdout and/or stderr, depending on what was requested when the stream was opened (via orte_output_open(), quite similar to opal_output_open()).

<sidenote>

We had a long discussion about using the IOF for all of this stuff (IOF has tags that could have been used). We ended up creating new RML sends instead for the following reasons:

- The idea of the IOF is great, but the implementation is lacking in many ways. We have several open tickets on the IOF, each of which will require a *lot* of work to fix. We did not want to undertake that right now; there are many complex issues involved in fixing the IOF.

- The argument that swayed me was that for orte_output() and orte_show_help(), the IOF adds little functionality over vanilla RML. In both the RML and IOF, you'd end up adding a callback handler to be called back when an output() or show_help() message arrives in the HNP. So why include all the (complex and known buggy) IOF infrastructure? Note: the functionality that we give up by using the RML instead of the IOF is what is not implemented in the IOF anyway (yet?) -- multiplexing output to multiple destinations. Perhaps someday the IOF will be fixed and it will be worthwhile, but probably not any time soon.

</sidenote>

2) Who is really generating the error message ?

Excellent question: keep in mind the difference between *generating* (or rendering) the error message and *outputting* the error message.

Whoever calls orte_output() or orte_show_help() renders the message (usually MPI processes, but even non-MPI-process entities such as orterun can invoke orte_show_help). The message is then sent to the HNP to be output.

In your item #2 I wonder how do you make the difference between what need to be printed once (such as the PML initialization error) and what is supposed to be printed multiple times (such as BTL TCP connection failure) ?

Only orte_show_help() messages are checked for duplicates. orte_output() are always output -- just like opal_output().

Note that show_help() messages are already uniquely identified by (filename, topic) tuples. So the MPI process renders the help message into a single string, and then RML sends this string along with the (filename, topic) tuple to the HNP. The HNP examines the (filename, topic) tuple to determine if a similar message has already been printed (remember that we have printf-like expansion in the rendering of show_help messages, so it's unlikely that two messages of the same (filename, topic) tuple will be *exactly* the same, because we frequently include the hostname or other process-specific data in the message). If a similar message has already been printed, the count for that (filename, topic) tuple is simply incremented and an aggregate "Hey, I got N duplicates" notice is printed every ~5 seconds (libevent is your friend) or when the HNP terminates -- whichever occurs first.

If the HPN is managing these error messages, this will force us to always install all error files, otherwise this approach cannot work on an heterogeneous environment (such as the local installation doesn't have infiniband support but the remote one include it).

The specific error message is rendered in the MPI process not only for the reason you cite, but also because it's the only one that has all the relevant data (potentially to include the helpfile). We could have bundled up all the printf-varargs and sent them to the HNP, but it didn't seem worth it. It would probably be a bandwidth savings, but a) who cares about optimizing error messages?, and b) that then falls into the problem you noted (need all the helpfiles on the HNP). Mebbe someday.

Regardless, the (filename, topic) tuple is still unique and doesn't require the actual file to reside in the HNP-visible filesystem. From the HNP's perspective, it's just a pair of text tags.

3) What is the OMPI layer supposed to use ? opal_output ? orte_output ? or maybe ompi_output ?

opal_output / opal_show_help is for serial processes, just like the rest of the OPAL layer.

We created these new ORTE functions to have job-specific intelligence.

Ralph and I talked about creating OMPI variants, but we didn't know what kind of additional MPI-specific intelligence would be useful. Got any suggestions?

--
Jeff Squyres
Cisco Systems

Reply via email to