On Jan 29, 2007, at 6:47 AM, Ralph H Castain wrote:




On 1/27/07 9:37 AM, "Greg Watson" <gwat...@lanl.gov> wrote:

There are two more interfaces that have changed:

1. orte_rds.query() now takes a job id, whereas in 1.2b1 it didn't
take any arguments. I seem to remember that I call this to kick orted
into action, but I'm not sure of the implications of not calling it.
In any case, I don't have a job id when I call it, so what do I pass
to get the old behavior?

For now, you can just use ORTE_JOBID_INVALID (defined in
orte/mca/ns/ns_types.h).

However, your question raises a flag. You should be calling
orte_rmgr.setup_job before you call the RDS, and that function returns the jobid for your job. Failing to call setup_job first may cause other parts of the code base to fail as they are expecting certain data to be setup in the
registry by setup_job.

If you do call setup_job first, then just pass the returned jobid along to
rds.query.

No, we have always called query() first, just after orte_init(). Since query() has never required a job id before, this used to work. I think the call was required to kick the SOH into action, but I'm not sure if it was needed for any other purpose.



2. orte_pls.terminate_job() now takes a list of attributes in
addition to a job id. What are the attributes for, and what happens
if I pass a NULL here? Do I  need to crate an empty attribute list?


You can always pass a NULL to any function looking for attributes - the
system knows how to handle that situation.

What you should pass here depends upon what you are trying to do. If you
just want to terminate a specific job, then you can just pass a NULL.
However, if you want to terminate the specified job AND any "children" that
were dynamically spawned by that job, then you need to pass the
ORTE_NS_INCLUDE_DESCENDANTS attribute - something like the following code
snippet (pulled from orterun) would work:

#include "opal/class/opal_list.h"

#include "orte/mca/pls/pls.h"
#include "orte/mca/rmgr/rmgr.h"
#include "orte/mca/ns/ns_types.h"
#include "orte/runtime/params.h"

    opal_list_t attrs;
    opal_list_item_t *item;

    OBJ_CONSTRUCT(&attrs, opal_list_t);
orte_rmgr.add_attribute(&attrs, ORTE_NS_INCLUDE_DESCENDANTS, ORTE_UNDEF,
NULL, ORTE_RMGR_ATTR_OVERRIDE);
    ret = orte_pls.terminate_job(jobid, &orte_abort_timeout, &attrs);
    while (NULL != (item = opal_list_remove_first(&attrs)))
OBJ_RELEASE(item);
    OBJ_DESTRUCT(&attrs);


Please note that the orte_pls.terminate_job API in 1.2 will undergo a change
in the next few days (it already is changed in the trunk). The change,
included in the code snippet above, adds a timeout capability to have the function "give up" if the job doesn't terminate within the specified time. The parameter given above references the orte-wide default value (adjustable
via MCA param), but you can give it anything you like - a NULL for the
timeout param means don't timeout so we'll try until you order us to quit.


Is this going to be in "1.2b4", or some other version? The previous API changes mean that PTP will no longer work with pre-1.2b3 versions. It sounds like this is going to cause a similar issue.

Are there likely to be further API changes before the release version? We are trying to release PTP, but I think this is impossible until your API's stabilize.

What about orte_ns.free_name()?

Thanks,

Greg


Reply via email to