[OMPI devel] IOF repair

Ralph Castain Wed, 9 Jul 2008 19:26:46 -0400

I have been investigating Ticket #1135 - stdin is read twice if rank=0
shares the node with mpirun. Repairing this problem is going to be quite
difficult due to the rather terrible spaghetti code in the IOF, and the fact
that the IOF in the HNP actually rml.sends the IO to itself multiple times
as it cycles through the spaghetti.


Unfortunately, this problem -is- a regression from 1.2. Rather than spending
weeks trying to fix it, I see two approaches we could pursue. First, I could
repair the problem by essentially returning the IOF to its 1.2 state. This
will have to be done by hand as most of the differences are in function
calls to utilities that have changed due to the removal of the old NS
framework. However, there are a few places where the logic itself has been
modified - and the problem must stem from somewhere in there.

If I make this change, then we will be no better, and no worse, than 1.2.
Note that we currently advise people to read from a file instead of from
stdin to avoid other issues that were present in 1.2.

Alternatively, we could ship 1.3 as-is, and warn users (similar to 1.2) that
they should avoiding reading from stdin if there is any chance that rank=0
could be co-located with mpirun. Note that most of our clusters do not allow
such co-location - but it is permitted by default by OMPI.

We already plan to revisit the IOF at next week's technical meeting, with a
goal of redefining the IOF's API to a more reduced set that reflects a less
ambitious requirement. I expect to implement those changes fairly soon
thereafter, but that would be targeted to 1.4 - not 1.3.

Any thoughts on which way we should go?
Ralph

[OMPI devel] IOF repair

Reply via email to