I do have some questions about this.1) If I correctly understood, we need the orte_output and orte_show_help in order to be able to make a difference between the application stdout/stderr and the MPI library ones ? Who is applying the filter ? The local daemon or the HNP ? How do we make sure that the remote outputs are not interlaced ?
2) Who is really generating the error message ? In your item #2 I wonder how do you make the difference between what need to be printed once (such as the PML initialization error) and what is supposed to be printed multiple times (such as BTL TCP connection failure) ? If the HPN is managing these error messages, this will force us to always install all error files, otherwise this approach cannot work on an heterogeneous environment (such as the local installation doesn't have infiniband support but the remote one include it).
3) What is the OMPI layer supposed to use ? opal_output ? orte_output ? or maybe ompi_output ?
george. On May 9, 2008, at 5:52 PM, Jeff Squyres wrote:
Per the teleconf this week, Ralph and I worked up two new features that we're nearly ready to put back in the trunk: 1. IBM+LANL needed a way to XML-ize all output that comes out of OMPI so that 3rd party tools can parse and use it intelligently (e.g., the PTP debugger can now distinguish between OMPI error messages and stderr from the MPI app). 2. In order to do #1, we created separate logical channels (vs, just throwing everything in stderr and letting IOF relay it back to the HNP) for the following: - stdout/stderr from the MPI app - opal_show_help() messages (***) - opal_output*() messages (***) As a side effect, we now filter show_help() messages and only print them *once* at the HNP (this has been a very long-standing goal of mine). So if your MPI app barfs, you will no longer see the same show_help() error message N times -- you'll see it only once, possibly accompanied with a "...and we got the same error message from N other processes" notice. (***) To make both #1 and #2 work, we had to raise the abstraction level. That is, there had to be job-level intelligence about the different kinds of output. So we have created orte_output() (and friends) and orte_show_help(). The OPAL variants still exist, but they *SHOULD NOT BE USED* by the MPI layer. Specifically, the OPAL variants are for what OPAL does best: single process stuff. The ORTE variants provide the job-level intelligence, such as duplicate show_help filtering, relaying to the HNP in a different channel than IOF, etc. So when this stuff hits the trunk, you'll see a ton of s/opal_output/ orte_output/g and /opal_show_help/orte_show_help/g changes throughout the code base. Do not be alarmed. :-) -- Jeff Squyres Cisco Systems _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
smime.p7s
Description: S/MIME cryptographic signature