BTW, if you want to look at the code and changelog, it's here:
http://www.open-mpi.org/hg/hgwebdir.cgi/rhc/channel/
FWIW, the short answer to many of your questions is: orte_show_help()
and orte_output() are almost identical to their OPAL counterparts.
They have a few extensions to provide simple and desirable semantics
that make the output scheme more cohesive to a job (vs. a single OPAL
process). They also allow better integration with 3rd party tools by
separating user MPI process output from OMPI infrastructure output.
FWIW: the "eliminate duplicate help messages" functionality is a side
effect of these goals; I seized upon the opportunity to implement it
with the new infrastructure because I've wanted this feature for a
long, long, looooong time. :-)
Here's a brief comparison of the two:
- opal_output() and opal_show_help() output to their process' stdout
and/or stderr streams (indistinguishable from other stdout/stderr
data). stdout/stderr from MPI processes is picked up by the orted's
IOF and RML sent to the HNP, and then output as part of the HNP's
stdout/stderr streams.
- orte_output() and orte_show_help() render the final message string
and then RML send it to the HNP. The HNP may apply some job-level
intelligence (e.g., adding XML tagging based on who the message came
from, eliminating duplicate show_help messages, etc.) before
outputting it somewhere, such as the HNP's stdout and/or stderr.
More details below.
On May 9, 2008, at 6:37 PM, George Bosilca wrote:
1) If I correctly understood, we need the orte_output and
orte_show_help in order to be able to make a difference between the
application stdout/stderr and the MPI library ones ? Who is applying
the filter ? The local daemon or the HNP ? How do we make sure that
the remote outputs are not interlaced ?
orte_show_help() and orte_output() messages are sent via RML to the
HNP (which is almost exactly what the IOF does). The HNP listens for
these RML messages and then displays them on the HNP stdout/stderr,
conditionally applying component-based filtering (e.g., sending the
output into a tagged XML stream, or whatever the filter component
does), and in the case of show_help messages, checking to see whether
the message has been displayed already or not.
orte_show_help() messages are always sent to the HNP stderr (just like
opal_show_help messages are always sent to stderr).
orte_output() messages are sent to the HNP's stdout and/or stderr,
depending on what was requested when the stream was opened (via
orte_output_open(), quite similar to opal_output_open()).
<sidenote>
We had a long discussion about using the IOF for all of this stuff
(IOF has tags that could have been used). We ended up creating new
RML sends instead for the following reasons:
- The idea of the IOF is great, but the implementation is lacking in
many ways. We have several open tickets on the IOF, each of which
will require a *lot* of work to fix. We did not want to undertake
that right now; there are many complex issues involved in fixing the
IOF.
- The argument that swayed me was that for orte_output() and
orte_show_help(), the IOF adds little functionality over vanilla RML.
In both the RML and IOF, you'd end up adding a callback handler to be
called back when an output() or show_help() message arrives in the
HNP. So why include all the (complex and known buggy) IOF
infrastructure? Note: the functionality that we give up by using the
RML instead of the IOF is what is not implemented in the IOF anyway
(yet?) -- multiplexing output to multiple destinations. Perhaps
someday the IOF will be fixed and it will be worthwhile, but probably
not any time soon.
</sidenote>
2) Who is really generating the error message ?
Excellent question: keep in mind the difference between *generating*
(or rendering) the error message and *outputting* the error message.
Whoever calls orte_output() or orte_show_help() renders the message
(usually MPI processes, but even non-MPI-process entities such as
orterun can invoke orte_show_help). The message is then sent to the
HNP to be output.
In your item #2 I wonder how do you make the difference between what
need to be printed once (such as the PML initialization error) and
what is supposed to be printed multiple times (such as BTL TCP
connection failure) ?
Only orte_show_help() messages are checked for duplicates.
orte_output() are always output -- just like opal_output().
Note that show_help() messages are already uniquely identified by
(filename, topic) tuples. So the MPI process renders the help message
into a single string, and then RML sends this string along with the
(filename, topic) tuple to the HNP. The HNP examines the (filename,
topic) tuple to determine if a similar message has already been
printed (remember that we have printf-like expansion in the rendering
of show_help messages, so it's unlikely that two messages of the same
(filename, topic) tuple will be *exactly* the same, because we
frequently include the hostname or other process-specific data in the
message). If a similar message has already been printed, the count
for that (filename, topic) tuple is simply incremented and an
aggregate "Hey, I got N duplicates" notice is printed every ~5 seconds
(libevent is your friend) or when the HNP terminates -- whichever
occurs first.
If the HPN is managing these error messages, this will force us to
always install all error files, otherwise this approach cannot work
on an heterogeneous environment (such as the local installation
doesn't have infiniband support but the remote one include it).
The specific error message is rendered in the MPI process not only for
the reason you cite, but also because it's the only one that has all
the relevant data (potentially to include the helpfile). We could
have bundled up all the printf-varargs and sent them to the HNP, but
it didn't seem worth it. It would probably be a bandwidth savings,
but a) who cares about optimizing error messages?, and b) that then
falls into the problem you noted (need all the helpfiles on the HNP).
Mebbe someday.
Regardless, the (filename, topic) tuple is still unique and doesn't
require the actual file to reside in the HNP-visible filesystem. From
the HNP's perspective, it's just a pair of text tags.
3) What is the OMPI layer supposed to use ? opal_output ?
orte_output ? or maybe ompi_output ?
opal_output / opal_show_help is for serial processes, just like the
rest of the OPAL layer.
We created these new ORTE functions to have job-specific intelligence.
Ralph and I talked about creating OMPI variants, but we didn't know
what kind of additional MPI-specific intelligence would be useful.
Got any suggestions?
--
Jeff Squyres
Cisco Systems