Re: [OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

Greg Watson Tue, 30 Jan 2007 11:24:24 -0500

Yes, we need the hostfile information before job execution. We callsetup_job() before a debug job to request a process allocation forthe application being debugged. We use spawn() to launch a non-debugapplication. It sounds like I should just leave things the way theycurrently are.

I think we've had the discussion about bproc before, but the reasonwe still support 1.0.2 is that the registry *is* populated with nodeinformation prior to launch. This was an agreed on feature thatOpenMPI was to provide for PTP. I haven't been able to test 1.2 on abproc machine (since I can't get it to work), but it sounds like thechanges removed this functionality. Frankly, this makes OpenMPI lessattractive to us, since we now have to go and get this informationourselves. My thinking now is that in the future we probably won'tuse OpenMPI for anything other than building and launching theapplication.


Greg

On Jan 29, 2007, at 6:57 PM, Ralph Castain wrote:

On further thought, perhaps I should be clearer. If you are sayingthat youneed to read the hostfile to display the cluster *before* the useractually
submits a job for execution, then fine - go ahead and call rds.query.
What I'm trying to communicate to you is that you need to callsetup_jobwhen you are launching the resulting application. If you want, youcould do
the following:
1. call orte_rds.query(ORTE_JOBID_INVALID) to get your host info.Note thatonly a hostfile will be read here - so if you are in (for example)a bproc
environment, you won't get any node info at this point.

2. when you are ready to launch the app, call orte_rmgr.spawn with an
attribute list that contains ORTE_RMGR_SPAWN_FLOW with a value of
ORTE_RMGR_SETUP | ORTE_RMGR_ALLOC | ORTE_RMGR_MAP |ORTE_RMGR_SETUP_TRIGS |ORTE_RMGR_LAUNCH. This will tell spawn to do everything *except*rds.query
so you avoid re-entering the hostfile info.
Unfortunately, if you want to see node info prior to launch onanythingother than a hostfile, we really don't have a way to do that rightnow. The
ORTE 2.0 design allows for it, but we haven't implemented that yet -
probably a few months away.

Hope that helps
Ralph


On 1/29/07 6:45 PM, "Ralph Castain" <r...@lanl.gov> wrote:
On 1/29/07 5:57 PM, "Greg Watson" <gwat...@lanl.gov> wrote:
Ralph,

On Jan 29, 2007, at 11:10 AM, Ralph H Castain wrote:
On 1/29/07 10:20 AM, "Greg Watson" <gwat...@lanl.gov> wrote:
No, we have always called query() first, just after orte_init().
Since query() has never required a job id before, this used towork.
I think the call was required to kick the SOH into action, but I'm
not sure if it was needed for any other purpose.
Query has nothing to do with the SOH - the only time you would
"need" it
would be if you are reading a hostfile. Otherwise, it doesn't do
anything at
the moment.


Not calling setup_job would be risky, in my opinion...
We've had this discussion before. We *need* to read the hostfile
before calling setup_job() because we have to populate the registry
with node information. If you're saying that this is now no longer
possible, then I'd respectfully ask that this functionality be
restored before you release 1.2. If there is some other way to
achieve this, then please let me know. We've been doing this ever
since 1.0 and in the alpha and beta versions of 1.2.
I think you don't understand what setup_job does. Setup_job has four
arguments:
(a) an array of app_context objects that contain the applicationto be
launched

(b) the number of elements in that array
(c) a pointer to a location where the jobid for this job is to bereturned;
and
(d) a list of attributes that allows the caller to "fine-tune"behavior
With that info, setup_job will:

(a) create a new jobid for your application; and

(b) store the app_context info in appropriate places in the registry
And that is *all* setup_job does - it simply gets a jobid andinitializes someimportant info in the registry. It never looks at nodeinformation, nor does
it in any way impact node info.
Calling rds.query after rmgr.setup_job is how we always do it. Intruth, theprecise ordering of those two operations is immaterial as theyhave absolutelynothing in common. However, we always do it in the described orderso thatrds.query can have a valid jobid. As I said, at the momentrds.query doesn'tactually use the jobid, though that will change at some point inthe future.
Although it isn't *absolutely* necessary, I would still suggestthat you call
rmgr.setup_job before calling rds.query to ensure that any subsequent
operations have all the info they require to function correctly.You can seethe progression we use in orte/mca/rmgr/urm/rmgr_urm.c - I believeyou will
find it helpful to follow that logic.
Alternatively, if you want, you can simply repeatedly callorte_rmgr.spawn anduse the attributes I built for you to step your way through thestandardlaunch. As you probably recall, I gave you the ability to specify- at a very
atomistic level - exactly which steps in the spawn process were to be
implemented at each call into rmgr.spawn. You can look at thereferenced file
to see the attribute for each step in the procedure.
Are there likely to be further API changes before the release
version? We are trying to release PTP, but I think this isimpossible
until your API's stabilize.
None planned, other than what I mentioned above. If you want to
support Open
MPI 1.2, you may need a slight phase shift, though, so you can see
the final
release.
Please explain "phase shift".

Greg
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------ End of Forwarded Message


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

Reply via email to