I do have some questions about this.

1) If I correctly understood, we need the orte_output and orte_show_help in order to be able to make a difference between the application stdout/stderr and the MPI library ones ? Who is applying the filter ? The local daemon or the HNP ? How do we make sure that the remote outputs are not interlaced ?

2) Who is really generating the error message ? In your item #2 I wonder how do you make the difference between what need to be printed once (such as the PML initialization error) and what is supposed to be printed multiple times (such as BTL TCP connection failure) ? If the HPN is managing these error messages, this will force us to always install all error files, otherwise this approach cannot work on an heterogeneous environment (such as the local installation doesn't have infiniband support but the remote one include it).

3) What is the OMPI layer supposed to use ? opal_output ? orte_output ? or maybe ompi_output ?

  george.

On May 9, 2008, at 5:52 PM, Jeff Squyres wrote:

Per the teleconf this week, Ralph and I worked up two new features
that we're nearly ready to put back in the trunk:

1. IBM+LANL needed a way to XML-ize all output that comes out of OMPI
so that 3rd party tools can parse and use it intelligently (e.g., the
PTP debugger can now distinguish between OMPI error messages and
stderr from the MPI app).

2. In order to do #1, we created separate logical channels (vs, just
throwing everything in stderr and letting IOF relay it back to the
HNP) for the following:
   - stdout/stderr from the MPI app
   - opal_show_help() messages (***)
   - opal_output*() messages (***)
As a side effect, we now filter show_help() messages and only print
them *once* at the HNP (this has been a very long-standing goal of
mine).  So if your MPI app barfs, you will no longer see the same
show_help() error message N times -- you'll see it only once, possibly
accompanied with a "...and we got the same error message from N other
processes" notice.

(***) To make both #1 and #2 work, we had to raise the abstraction
level.  That is, there had to be job-level intelligence about the
different kinds of output.  So we have created orte_output() (and
friends) and orte_show_help().  The OPAL variants still exist, but
they *SHOULD NOT BE USED* by the MPI layer.  Specifically, the OPAL
variants are for what OPAL does best: single process stuff.  The ORTE
variants provide the job-level intelligence, such as duplicate
show_help filtering, relaying to the HNP in a different channel than
IOF, etc.

So when this stuff hits the trunk, you'll see a ton of s/opal_output/
orte_output/g and /opal_show_help/orte_show_help/g changes throughout
the code base.  Do not be alarmed.  :-)

--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to