Re: [OMPI devel] Fwd: OpenMPI changes

Greg Watson Tue, 4 Mar 2008 18:42:02 -0500

Ralph,

Looking at PTP, the only thing we need is to query the processinformation (PID, rank, node) when the job is created. Perhaps if onlyqueries are allowed from callbacks then recursion would be eliminated?

If you can get this functionality into your new interface and back inthe trunk, I take a look at porting PTP to use it.


Thanks,
Greg

On Mar 4, 2008, at 6:14 PM, Ralph Castain wrote:

Yeah, the problem we had in the past was:

1. something would trigger in the system - e.g., a particular jobstate wasreached. This would cause us to execute a callback function via theGPR

2. the callback function would take some action. Typically, thisinvolvedsending out a message or calling another function. Either way, theeventualresult of that action would be to cause another GPR trigger to fire- either

the job or a process changing state

This loop would continue ad infinitum. Sometimes, I would see stacktraceshundreds of calls deep. Debugging and maintaining something thatintertwined

was impossible.

People tried to impose order by establishing rules about what couldand

could not be called from various situations, but that also proved

intractable. Problem was that we could get it to work for a "normal"code

path, but all the variety of failure modes, combined with all the

flexibility built into the code base, created so many code pathsthat you

inevitably wound up deadlocked under some corner case conditions.

Which we generally agreed was unacceptable.

It -is- possible to have callback functions that avoid this situation.

However, it is very easy to make a mistake and "hang" the wholesystem. Just

seemed easier to avoid the entire problem. (I don't get that option!)

The ability to get an allocation without launching is easy to add.

I/O forwarding is currently an issue. Our IOF doesn't seem to likeit when Itry to create an "alternate" tap (the default always goes backthrough thepersistent orted, so the tool looks like a second "tap" on theflow). Thisis noted as a "bug" on our tracker, and I expect it will beaddressed prior

to releasing 1.3. I will ask that it be raised in priority.

I'll review what I had done and see about bringing it into the trunkby the

end of the week.

Ralph



On 3/4/08 4:00 PM, "Greg Watson" <g.wat...@computer.org> wrote:

I don't have a problem using a different interface, assuming it's

adequately supported and provides the functionality we need. Ipresume

the recursive behavior you're referring to is calling OMPI interfaces
from the callback functions. Any event-based system has this issue,
and it is usually solved by clearly specifying the allowable
interfaces that can be called (possibly none). Since PTP doesn't call
OMPI functions from callbacks, it's not a problem for us if no
interfaces can be called.

The major missing features appear to be:

- Ability to request a process allocation without launching the job
- I/O forwarding callbacks

Without these, PTP support will be so limited that I'd be reluctantto

say we support OMPI.

Greg

On Mar 4, 2008, at 4:50 PM, Ralph H Castain wrote:

It is buried deep-down in the thread, but I'll just reiterate it
here. I
have "restored" the ability to "subscribe" to changes in job, proc,
and node
state via OMPI's tool interface library. I have -not- checked this
into the
trunk yet, though, until the community has a chance to consider
whether or
not it wants it.

Restoring the ability to have such changes "callback" to user
functions
raises the concern again about recursive behavior. We worked hard to
remove
recursion from the code base, and it would be a concern to see it
potentially re-enter.

I realize there is some difference between ORTE calling back into
itself vs
calling back into a user-specified function. However, unless that
user truly
understands ORTE/OMPI and takes considerable precautions, it is very
easy to
recreate the recursive behavior without intending to do so.

The tool interface library was built to accomplish two things:

1. help reduce the impact on external tools of changes to ORTE/OMPI
interfaces, and

2. provide a degree of separation to prevent the tool from
inadvertently
causing OMPI to "behave badly"

I think we accomplished that - I would encourage you to at least
consider
using the library. If there is something missing, we can always add
it.

Ralph



On 3/4/08 2:37 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:

Greg --

I admit to being a bit puzzled here.  Ralph sent around RFCs about
these changes many months ago.  Everyone said they didn't want this
functionality -- it was seen as excess functionality that Open MPI
didn't want or need -- so it was all removed.

As such, I have to agree with Ralph that it is an "enhancement" to
re-

add the functionality. That being said, patches are alwayswelcome!IBM has signed the OMPI 3rd party contribution agreement, so itcould

be contributed directly.

Sidenote: I was also under the impression that PTP was being re-
geared
towards STCI and moving away from ORTE anyway.  Is this incorrect?



On Mar 4, 2008, at 3:24 PM, Greg Watson wrote:

Hi all,

Ralph informs me that significant functionality has been removed
from
ORTE in 1.3. Unfortunately this functionality was being used by
PTP to

provide support for OMPI, and without it, it seems unlikely thatPTP

will be able to work with 1.3. Apparently restoring this lost
functionality is an "enhancement" of 1.3, and so is something that
will not necessarily be done. Having worked with OMPI from a very

early stage to ensure that we were able to provide robustsupport, I

must say it is a bit disappointing that this approach is being
taken.
I hope that the community will view this "enhancement" as
worthwhile.

Regards,

Greg

Begin forwarded message:


On 2/29/08 7:13 AM, "Gregory R Watson" <g...@us.ibm.com> wrote:



Ralph Castain <r...@lanl.gov> wrote on 02/29/2008 12:18:39 AM:

Ralph Castain <r...@lanl.gov>
02/29/08 12:18 AM

To

Gregory R Watson/Watson/IBM@IBMUS

cc

Subject

Re: OpenMPI changes

Hi Greg

All of the prior options (and some new ones) for spawning a job

are fully

supported in the new interface. Instead of setting them with

"attributes",

you create an orte_job_t object and just fill them in. This is

precisely how

mpirun does it - you can look at that code if you want an

example, though it

is somewhat complex. Alternatively, you can look at the wayit is

done for

comm_spawn, which may be more analogous to your situation -that

code is in

ompi/mca/dpm/orte.

All the tools library does is communicate the job object to the

target

persistent daemon so it can do the work. This way, you don'thave

to open

all the frameworks, deal directly with the plm interface, etc.
Alternatively, you are welcome to do a full orte_init and usethe

frameworks

yourself - there is no requirement to use the library. I only

offer it as an

alternative.
As far as I can tell, neither API provides the samefunctionality

as that

available in 1.2. While this might be beneficial for OMPI-specific

activities,

the changes appear to severely limit the interaction of toolswith

the

runtime. At this point, I can't see either interface supporting
PTP.


I went ahead and added a notification capability to the system -
took about
30 minutes. I can provide notice of job and process state changes
since I
see those. Node state changes, however, are different - I can
notify
on
them, but we have no way of seeing them. None of the environments
we
support
tell us when a node fails.

I know that the tool library works because it uses theidentical

APIs as

comm_spawn and mpirun. I have also tested them by building myown

tools.


There's a big difference being on a code path that *must* work

because it is

used by core components, to one that is provided as an add-onfor

external

tools. I may be worrying needlessly if this new interface
becomes an
"officially supported" API. Is that planned? At a minimum, it

seems like it's

going to complicate your testing process, since you're going to

need to

provide a separate set of tests that exercise this interface

independent of

the rest of OMPI.

It is an officially supported API. Testing is not as big aproblem

as you
might expect since the library exercises the same code paths as
mpirun and

comm_spawn. Like I said, I have written my own tools thatexercise

the
library - no problem using them as tests.

We do not launch an orted for any tool-library query. All we do
is
communicate the query to the target persistent daemon ormpirun.

Those

entities have recv's posted to catch any incoming messages and

execute the

request.
You are correct that we no longer have event drivennotification

in the

system. I repeatedly asked the community (on both devel andcore

lists) for

input on that question, and received no indications that anyone

wanted it

supported. It can be added back into the system, but would

require the

approval of the OMPI community. I don't know how problematicthat

would be -

there is a lot of concern over the amount of memory, overhead,

and potential

reliability issues that surround event notification. If youwant

that

capability, I suggest we discuss it, come up with a plan that

deals with

those issues, and then take a proposal to the devel list for

discussion.

As for reliability, the objectives of the last year's effortwere

precisely

scalability and reliability. We did a lot of work to eliminate

recursive

deadlocks and improve the reliability of the code. Our current

testing

indicates we had considerable success in that regard,

particularly with the

recursion elimination commit earlier today.
I would be happy to work with you to meet the PTP's needs -we'll

just need

to work with the OMPI community to ensure everyone buys intothe

plan. If it

would help, I could come and review the new arch with theteam (I

already

gave a presentation on it to IBM Rochester MN) and discuss
required
enhancements.
PTP's needs have not changed since 1.0. From our perspective,the

1.3 branch

simply removes functionality that is required for PTP to support

OMPI. It

seems strange that we need "approval of the OMPI community" to

continue to use

functionality that has been available since 1.0. In any case,

there are

unfortunately no resources to work on the kind of re-engineering

that appears

to be required to support 1.3, even if it did provide the

functionality we

need.


Afraid I have to be driven by the OMPI community's requirements
since they
pay my salary :-)  What they need is a "lean, mean, OMPI machine"
as
they
say, and (for some reason) they view the debugger community as
consisting of

folks like totalview, vampirtrace, etc. - all of whom getinvolved

(either
directly or via one of the OMPI members) in the requirements
discussions.

Can't argue with business decisions, though. I gather there was
some
mention
of PTP at the recent LANL/IBM RR meeting, so I'll let people know
that PTP
won't be an option on RR.

And I'll see if there is any interest here in adding 1.3support to

PTP
ourselves - from looking at your code, I think it would take
about a
day,
assuming someone more familiar with PTP will work with me.

Take care
Ralph


Greg


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Fwd: OpenMPI changes

Reply via email to