Hi Ralph,

Output attached in a file.
Thanks a lot!

Best,
Suraj

{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww30340\viewh23120\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural

\f0\fs24 \cf0 [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_tm_module.c:157\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:315\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE INIT_COMPLETE PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:326\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING ALLOCATION PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:421\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE ALLOCATION COMPLETE PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:182\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING DAEMON LAUNCH PRI 4\
[grsacc19:29071] mca: base: components_register: registering state components\
[grsacc19:29071] mca: base: components_register: found loaded component app\
[grsacc19:29071] mca: base: components_register: component app has no register or open function\
[grsacc19:29071] mca: base: components_register: found loaded component hnp\
[grsacc19:29071] mca: base: components_register: component hnp has no register or open function\
[grsacc19:29071] mca: base: components_register: found loaded component novm\
[grsacc19:29071] mca: base: components_register: component novm register function successful\
[grsacc19:29071] mca: base: components_register: found loaded component orted\
[grsacc19:29071] mca: base: components_register: component orted has no register or open function\
[grsacc19:29071] mca: base: components_register: found loaded component staged_hnp\
[grsacc19:29071] mca: base: components_register: component staged_hnp has no register or open function\
[grsacc19:29071] mca: base: components_register: found loaded component staged_orted\
[grsacc19:29071] mca: base: components_register: component staged_orted has no register or open function\
[grsacc19:29071] mca: base: components_open: opening state components\
[grsacc19:29071] mca: base: components_open: found loaded component app\
[grsacc19:29071] mca: base: components_open: component app open function successful\
[grsacc19:29071] mca: base: components_open: found loaded component hnp\
[grsacc19:29071] mca: base: components_open: component hnp open function successful\
[grsacc19:29071] mca: base: components_open: found loaded component novm\
[grsacc19:29071] mca: base: components_open: component novm open function successful\
[grsacc19:29071] mca: base: components_open: found loaded component orted\
[grsacc19:29071] mca: base: components_open: component orted open function successful\
[grsacc19:29071] mca: base: components_open: found loaded component staged_hnp\
[grsacc19:29071] mca: base: components_open: component staged_hnp open function successful\
[grsacc19:29071] mca: base: components_open: found loaded component staged_orted\
[grsacc19:29071] mca: base: components_open: component staged_orted open function successful\
[grsacc19:29071] mca:base:select: Auto-selecting state components\
[grsacc19:29071] mca:base:select:(state) Querying component [app]\
[grsacc19:29071] mca:base:select:(state) Skipping component [app]. Query failed to return a module\
[grsacc19:29071] mca:base:select:(state) Querying component [hnp]\
[grsacc19:29071] mca:base:select:(state) Skipping component [hnp]. Query failed to return a module\
[grsacc19:29071] mca:base:select:(state) Querying component [novm]\
[grsacc19:29071] mca:base:select:(state) Skipping component [novm]. Query failed to return a module\
[grsacc19:29071] mca:base:select:(state) Querying component [orted]\
[grsacc19:29071] mca:base:select:(state) Query of component [orted] set priority to 100\
[grsacc19:29071] mca:base:select:(state) Querying component [staged_hnp]\
[grsacc19:29071] mca:base:select:(state) Skipping component [staged_hnp]. Query failed to return a module\
[grsacc19:29071] mca:base:select:(state) Querying component [staged_orted]\
[grsacc19:29071] mca:base:select:(state) Skipping component [staged_orted]. Query failed to return a module\
[grsacc19:29071] mca:base:select:(state) Selected component [orted]\
[grsacc19:29071] mca: base: close: component app closed\
[grsacc19:29071] mca: base: close: unloading component app\
[grsacc19:29071] mca: base: close: component hnp closed\
[grsacc19:29071] mca: base: close: unloading component hnp\
[grsacc19:29071] mca: base: close: component novm closed\
[grsacc19:29071] mca: base: close: unloading component novm\
[grsacc19:29071] mca: base: close: component staged_hnp closed\
[grsacc19:29071] mca: base: close: unloading component staged_hnp\
[grsacc19:29071] mca: base: close: component staged_orted closed\
[grsacc19:29071] mca: base: close: unloading component staged_orted\
[grsacc19:29071] ORTE_JOB_STATE_MACHINE:\
[grsacc19:29071] 	State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED\
[grsacc19:29071] 	State: FORCED EXIT cbfunc: DEFINED\
[grsacc19:29071] 	State: DAEMONS TERMINATED cbfunc: DEFINED\
[grsacc19:29071] ORTE_PROC_STATE_MACHINE:\
[grsacc19:29071] 	State: RUNNING cbfunc: DEFINED\
[grsacc19:29071] 	State: SYNC REGISTERED cbfunc: DEFINED\
[grsacc19:29071] 	State: IOF COMPLETE cbfunc: DEFINED\
[grsacc19:29071] 	State: WAITPID FIRED cbfunc: DEFINED\
[grsacc19:29071] mca: base: components_register: registering errmgr components\
[grsacc19:29071] mca: base: components_register: found loaded component default_app\
[grsacc19:29071] mca: base: components_register: component default_app register function successful\
[grsacc19:29071] mca: base: components_register: found loaded component default_hnp\
[grsacc19:29071] mca: base: components_register: component default_hnp register function successful\
[grsacc19:29071] mca: base: components_register: found loaded component default_orted\
[grsacc19:29071] mca: base: components_register: component default_orted register function successful\
[grsacc19:29071] mca: base: components_open: opening errmgr components\
[grsacc19:29071] mca: base: components_open: found loaded component default_app\
[grsacc19:29071] mca: base: components_open: component default_app open function successful\
[grsacc19:29071] mca: base: components_open: found loaded component default_hnp\
[grsacc19:29071] mca: base: components_open: component default_hnp open function successful\
[grsacc19:29071] mca: base: components_open: found loaded component default_orted\
[grsacc19:29071] mca: base: components_open: component default_orted open function successful\
[grsacc19:29071] mca:base:select: Auto-selecting errmgr components\
[grsacc19:29071] mca:base:select:(errmgr) Querying component [default_app]\
[grsacc19:29071] mca:base:select:(errmgr) Skipping component [default_app]. Query failed to return a module\
[grsacc19:29071] mca:base:select:(errmgr) Querying component [default_hnp]\
[grsacc19:29071] mca:base:select:(errmgr) Skipping component [default_hnp]. Query failed to return a module\
[grsacc19:29071] mca:base:select:(errmgr) Querying component [default_orted]\
[grsacc19:29071] mca:base:select:(errmgr) Query of component [default_orted] set priority to 1000\
[grsacc19:29071] mca:base:select:(errmgr) Selected component [default_orted]\
[grsacc19:29071] mca: base: close: component default_app closed\
[grsacc19:29071] mca: base: close: unloading component default_app\
[grsacc19:29071] mca: base: close: component default_hnp closed\
[grsacc19:29071] mca: base: close: unloading component default_hnp\
[grsacc19:29071] [[6816,0],1] FORCE-TERMINATE AT oob_tcp_sendrecv.c:430\
[grsacc19:29066] [[6816,0],1] ACTIVATE PROC [[6816,0],0] STATE LIFELINE LOST AT oob_tcp_component.c:1102\
[grsacc19:29071] [[6816,0],1] ACTIVATE JOB NULL STATE FORCED EXIT AT oob_tcp_sendrecv.c:430\
[grsacc19:29071] [[6816,0],1] ACTIVATING JOB NULL STATE FORCED EXIT PRI 0\
[grsacc19:29071] mca: base: close: component default_orted closed\
[grsacc19:29066] [[6816,0],1] ACTIVATING PROC [[6816,0],0] STATE LIFELINE LOST PRI 0\
[grsacc19:29066] [[6816,0],1] errmgr:default_orted:proc_errors process [[6816,0],0] error state LIFELINE LOST\
[grsacc19:29066] [[6816,0],1] errmgr:orted lifeline lost - ex[grsacc19:29071] mca: base: close: unloading component default_orted\
[grsacc19:29071] mca: base: close: component orted closed\
[grsacc19:29071] mca: base: close: unloading component orted\
iting\
[grsacc18:16957] mca: base: components_register: registering state components\
[grsacc18:16957] mca: base: components_register: found loaded component app\
[grsacc18:16957] mca: base: components_register: component app has no register or open function\
[grsacc18:16957] mca: base: components_register: found loaded component hnp\
[grsacc18:16957] mca: base: components_register: component hnp has no register or open function\
[grsacc18:16957] mca: base: components_register: found loaded component novm\
[grsacc18:16957] mca: base: components_register: component novm register function successful\
[grsacc18:16957] mca: base: components_register: found loaded component orted\
[grsacc18:16957] mca: base: components_register: component orted has no register or open function\
[grsacc18:16957] mca: base: components_register: found loaded component staged_hnp\
[grsacc18:16957] mca: base: components_register: component staged_hnp has no register or open function\
[grsacc18:16957] mca: base: components_register: found loaded component staged_orted\
[grsacc18:16957] mca: base: components_register: component staged_orted has no register or open function\
[grsacc18:16957] mca: base: components_open: opening state components\
[grsacc18:16957] mca: base: components_open: found loaded component app\
[grsacc18:16957] mca: base: components_open: component app open function successful\
[grsacc18:16957] mca: base: components_open: found loaded component hnp\
[grsacc18:16957] mca: base: components_open: component hnp open function successful\
[grsacc18:16957] mca: base: components_open: found loaded component novm\
[grsacc18:16957] mca: base: components_open: component novm open function successful\
[grsacc18:16957] mca: base: components_open: found loaded component orted\
[grsacc18:16957] mca: base: components_open: component orted open function successful\
[grsacc18:16957] mca: base: components_open: found loaded component staged_hnp\
[grsacc18:16957] mca: base: components_open: component staged_hnp open function successful\
[grsacc18:16957] mca: base: components_open: found loaded component staged_orted\
[grsacc18:16957] mca: base: components_open: component staged_orted open function successful\
[grsacc18:16957] mca:base:select: Auto-selecting state components\
[grsacc18:16957] mca:base:select:(state) Querying component [app]\
[grsacc18:16957] mca:base:select:(state) Skipping component [app]. Query failed to return a module\
[grsacc18:16957] mca:base:select:(state) Querying component [hnp]\
[grsacc18:16957] mca:base:select:(state) Skipping component [hnp]. Query failed to return a module\
[grsacc18:16957] mca:base:select:(state) Querying component [novm]\
[grsacc18:16957] mca:base:select:(state) Skipping component [novm]. Query failed to return a module\
[grsacc18:16957] mca:base:select:(state) Querying component [orted]\
[grsacc18:16957] mca:base:select:(state) Query of component [orted] set priority to 100\
[grsacc18:16957] mca:base:select:(state) Querying component [staged_hnp]\
[grsacc18:16957] mca:base:select:(state) Skipping component [staged_hnp]. Query failed to return a module\
[grsacc18:16957] mca:base:select:(state) Querying component [staged_orted]\
[grsacc18:16957] mca:base:select:(state) Skipping component [staged_orted]. Query failed to return a module\
[grsacc18:16957] mca:base:select:(state) Selected component [orted]\
[grsacc18:16957] mca: base: close: component app closed\
[grsacc18:16957] mca: base: close: unloading component app\
[grsacc18:16957] mca: base: close: component hnp closed\
[grsacc18:16957] mca: base: close: unloading component hnp\
[grsacc18:16957] mca: base: close: component novm closed\
[grsacc18:16957] mca: base: close: unloading component novm\
[grsacc18:16957] mca: base: close: component staged_hnp closed\
[grsacc18:16957] mca: base: close: unloading component staged_hnp\
[grsacc18:16957] mca: base: close: component staged_orted closed\
[grsacc18:16957] mca: base: close: unloading component staged_orted\
[grsacc18:16957] ORTE_JOB_STATE_MACHINE:\
[grsacc18:16957] 	State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED\
[grsacc18:16957] 	State: FORCED EXIT cbfunc: DEFINED\
[grsacc18:16957] 	State: DAEMONS TERMINATED cbfunc: DEFINED\
[grsacc18:16957] ORTE_PROC_STATE_MACHINE:\
[grsacc18:16957] 	State: RUNNING cbfunc: DEFINED\
[grsacc18:16957] 	State: SYNC REGISTERED cbfunc: DEFINED\
[grsacc18:16957] 	State: IOF COMPLETE cbfunc: DEFINED\
[grsacc18:16957] 	State: WAITPID FIRED cbfunc: DEFINED\
[grsacc18:16957] mca: base: components_register: registering errmgr components\
[grsacc18:16957] mca: base: components_register: found loaded component default_app\
[grsacc18:16957] mca: base: components_register: component default_app register function successful\
[grsacc18:16957] mca: base: components_register: found loaded component default_hnp\
[grsacc18:16957] mca: base: components_register: component default_hnp register function successful\
[grsacc18:16957] mca: base: components_register: found loaded component default_orted\
[grsacc18:16957] mca: base: components_register: component default_orted register function successful\
[grsacc18:16957] mca: base: components_open: opening errmgr components\
[grsacc18:16957] mca: base: components_open: found loaded component default_app\
[grsacc18:16957] mca: base: components_open: component default_app open function successful\
[grsacc18:16957] mca: base: components_open: found loaded component default_hnp\
[grsacc18:16957] mca: base: components_open: component default_hnp open function successful\
[grsacc18:16957] mca: base: components_open: found loaded component default_orted\
[grsacc18:16957] mca: base: components_open: component default_orted open function successful\
[grsacc18:16957] mca:base:select: Auto-selecting errmgr components\
[grsacc18:16957] mca:base:select:(errmgr) Querying component [default_app]\
[grsacc18:16957] mca:base:select:(errmgr) Skipping component [default_app]. Query failed to return a module\
[grsacc18:16957] mca:base:select:(errmgr) Querying component [default_hnp]\
[grsacc18:16957] mca:base:select:(errmgr) Skipping component [default_hnp]. Query failed to return a module\
[grsacc18:16957] mca:base:select:(errmgr) Querying component [default_orted]\
[grsacc18:16957] mca:base:select:(errmgr) Query of component [default_orted] set priority to 1000\
[grsacc18:16957] mca:base:select:(errmgr) Selected component [default_orted]\
[grsacc18:16957] mca: base: close: component default_app closed\
[grsacc18:16957] mca: base: close: unloading component default_app\
[grsacc18:16957] mca: base: close: component default_hnp closed\
[grsacc18:16957] mca: base: close: unloading component default_hnp\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE ALL DAEMONS REPORTED AT base/plm_base_launch_support.c:842\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE ALL DAEMONS REPORTED PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE VM READY AT base/plm_base_launch_support.c:170\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE VM READY PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING MAPPING AT base/plm_base_launch_support.c:207\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING MAPPING PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:316\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE MAP COMPLETE PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:233\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING FINAL SYSTEM PREP PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:410\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING APP LAUNCH PRI 4\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1593\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE LOCAL LAUNCH COMPLETE PRI 4\
[grsacc18:16957] [[6816,0],2] ACTIVATE PROC [[6816,2],0] STATE RUNNING AT base/odls_base_default_fns.c:1545\
[grsacc18:16957] [[6816,0],2] ACTIVATING PROC [[6816,2],0] STATE RUNNING PRI 4\
[grsacc18:16957] [[6816,0],2] ACTIVATE JOB [6816,2] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1593\
[grsacc18:16957] [[6816,0],2] ACTIVATING JOB [6816,2] STATE LOCAL LAUNCH COMPLETE PRI 4\
[grsacc18:16957] [[6816,0],2] state:orted:track_procs called for proc [[6816,2],0] state RUNNING\
[grsacc18:16957] [[6816,0],2] state:orted:track_jobs sending local launch complete for job [6816,2]\
[grsacc20:04946] [[6816,0],0] ACTIVATE PROC [[6816,2],0] STATE RUNNING AT base/plm_base_receive.c:296\
[grsacc20:04946] [[6816,0],0] ACTIVATING PROC [[6816,2],0] STATE RUNNING PRI 4\
[grsacc18:16959] mca: base: components_register: registering state components\
[grsacc18:16959] mca: base: components_register: found loaded component app\
[grsacc18:16959] mca: base: components_register: component app has no register or open function\
[grsacc18:16959] mca: base: components_register: found loaded component hnp\
[grsacc18:16959] mca: base: components_register: component hnp has no register or open function\
[grsacc18:16959] mca: base: components_register: found loaded component novm\
[grsacc18:16959] mca: base: components_register: component novm register function successful\
[grsacc18:16959] mca: base: components_register: found loaded component orted\
[grsacc18:16959] mca: base: components_register: component orted has no register or open function\
[grsacc18:16959] mca: base: components_register: found loaded component staged_hnp\
[grsacc18:16959] mca: base: components_register: component staged_hnp has no register or open function\
[grsacc18:16959] mca: base: components_register: found loaded component staged_orted\
[grsacc18:16959] mca: base: components_register: component staged_orted has no register or open function\
[grsacc18:16959] mca: base: components_open: opening state components\
[grsacc18:16959] mca: base: components_open: found loaded component app\
[grsacc18:16959] mca: base: components_open: component app open function successful\
[grsacc18:16959] mca: base: components_open: found loaded component hnp\
[grsacc18:16959] mca: base: components_open: component hnp open function successful\
[grsacc18:16959] mca: base: components_open: found loaded component novm\
[grsacc18:16959] mca: base: components_open: component novm open function successful\
[grsacc18:16959] mca: base: components_open: found loaded component orted\
[grsacc18:16959] mca: base: components_open: component orted open function successful\
[grsacc18:16959] mca: base: components_open: found loaded component staged_hnp\
[grsacc18:16959] mca: base: components_open: component staged_hnp open function successful\
[grsacc18:16959] mca: base: components_open: found loaded component staged_orted\
[grsacc18:16959] mca: base: components_open: component staged_orted open function successful\
[grsacc18:16959] mca:base:select: Auto-selecting state components\
[grsacc18:16959] mca:base:select:(state) Querying component [app]\
[grsacc18:16959] mca:base:select:(state) Query of component [app] set priority to 1000\
[grsacc18:16959] mca:base:select:(state) Querying component [hnp]\
[grsacc18:16959] mca:base:select:(state) Skipping component [hnp]. Query failed to return a module\
[grsacc18:16959] mca:base:select:(state) Querying component [novm]\
[grsacc18:16959] mca:base:select:(state) Skipping component [novm]. Query failed to return a module\
[grsacc18:16959] mca:base:select:(state) Querying component [orted]\
[grsacc18:16959] mca:base:select:(state) Skipping component [orted]. Query failed to return a module\
[grsacc18:16959] mca:base:select:(state) Querying component [staged_hnp]\
[grsacc18:16959] mca:base:select:(state) Skipping component [staged_hnp]. Query failed to return a module\
[grsacc18:16959] mca:base:select:(state) Querying component [staged_orted]\
[grsacc18:16959] mca:base:select:(state) Skipping component [staged_orted]. Query failed to return a module\
[grsacc18:16959] mca:base:select:(state) Selected component [app]\
[grsacc18:16959] mca: base: close: component hnp closed\
[grsacc18:16959] mca: base: close: unloading component hnp\
[grsacc18:16959] mca: base: close: component novm closed\
[grsacc18:16959] mca: base: close: unloading component novm\
[grsacc18:16959] mca: base: close: component orted closed\
[grsacc18:16959] mca: base: close: unloading component orted\
[grsacc18:16959] mca: base: close: component staged_hnp closed\
[grsacc18:16959] mca: base: close: unloading component staged_hnp\
[grsacc18:16959] mca: base: close: component staged_orted closed\
[grsacc18:16959] mca: base: close: unloading component staged_orted\
[grsacc20:04946] [[6816,0],0] state:base:track_procs called for proc [[6816,2],0] state RUNNING\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE RUNNING AT base/state_base_fns.c:482\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE RUNNING PRI 4\
[grsacc18:16959] mca: base: components_register: registering errmgr components\
[grsacc18:16959] mca: base: components_register: found loaded component default_app\
[grsacc18:16959] mca: base: components_register: component default_app register function successful\
[grsacc18:16959] mca: base: components_register: found loaded component default_hnp\
[grsacc18:16959] mca: base: components_register: component default_hnp register function successful\
[grsacc18:16959] mca: base: components_register: found loaded component default_orted\
[grsacc18:16959] mca: base: components_register: component default_orted register function successful\
[grsacc18:16959] mca: base: components_open: opening errmgr components\
[grsacc18:16959] mca: base: components_open: found loaded component default_app\
[grsacc18:16959] mca: base: components_open: component default_app open function successful\
[grsacc18:16959] mca: base: components_open: found loaded component default_hnp\
[grsacc18:16959] mca: base: components_open: component default_hnp open function successful\
[grsacc18:16959] mca: base: components_open: found loaded component default_orted\
[grsacc18:16959] mca: base: components_open: component default_orted open function successful\
[grsacc18:16959] mca:base:select: Auto-selecting errmgr components\
[grsacc18:16959] mca:base:select:(errmgr) Querying component [default_app]\
[grsacc18:16959] mca:base:select:(errmgr) Query of component [default_app] set priority to 1000\
[grsacc18:16959] mca:base:select:(errmgr) Querying component [default_hnp]\
[grsacc18:16959] mca:base:select:(errmgr) Skipping component [default_hnp]. Query failed to return a module\
[grsacc18:16959] mca:base:select:(errmgr) Querying component [default_orted]\
[grsacc18:16959] mca:base:select:(errmgr) Skipping component [default_orted]. Query failed to return a module\
[grsacc18:16959] mca:base:select:(errmgr) Selected component [default_app]\
[grsacc18:16959] mca: base: close: component default_hnp closed\
[grsacc18:16959] mca: base: close: unloading component default_hnp\
[grsacc18:16959] mca: base: close: component default_orted closed\
[grsacc18:16959] mca: base: close: unloading component default_orted\
[grsacc18:16957] [[6816,0],2] ACTIVATE PROC [[6816,2],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1836\
[grsacc18:1[grsacc20:04946] [[6816,0],0] ACTIVATE PROC [[6816,2],0] STATE SYNC REGISTERED AT base/plm_base_receive.c:354\
6957] [[grsacc20:04946] [[6816,0],0] ACTIVATING PROC [[6816,2],0] STATE SYNC REGISTERED PRI 4\
[6816,0],2] ACTIVATI[grsacc20:04946] [[6816,0],0] state:base:track_procs called for proc [[6816,2],0] state SYNC REGISTERED\
NG PR[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE SYNC REGISTERED AT base/state_base_fns.c:490\
OC [[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE SYNC REGISTERED PRI 4\
[6816,2],0] STA[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:609\
TE SY[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE READY FOR DEBUGGERS PRI 4\
NC REGISTERED PRI 4\
[grsacc18:16957] [[6816,0],2] state:orted:track_procs called for proc [[6816,2],0] state SYNC REGISTERED\
[grsacc18:16957] [[6816,0],2] state:orted: sending contact info to HNP\
[grsacc20:04946] [[6816,0],0] ACTIVATE PROC [[6816,0],1] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:1301\
[grsacc20:04946] [[6816,0],0] ACTIVATING PROC [[6816,0],1] STATE COMMUNICATION FAILURE PRI 0\
[grsacc20:04946] [[6816,0],0] errmgr:default_hnp: for proc [[6816,0],1] state COMMUNICATION FAILURE\
[grsacc20:04946] [[6816,0],0] Comm failure: daemon [[6816,0],1] - aborting\
[grsacc20:04946] [[6816,0],0] errmgr:default_hnp: abort called on job [6816,0]\
[grsacc20:04946] [[6816,0],0] errmgr:default_hnp: ordering orted termination\
[grsacc19:29066] [[6816,0],1] FORCE-TERMINATE AT errmgr_default_orted.c:259\
[grsacc19:29066] [[6816,0],1] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:259\
[grsacc19:29066] [[6816,0],1] ACTIVATING JOB NULL STATE FORCED EXIT PRI 0\
[grsacc19:29066] mca: base: close: component default_orted closed\
[grsacc19:29066] mca: base: close: unloading component default_orted\
[grsacc19:29066] mca: base: close: component orted closed\
[grsacc19:29066] mca: base: close: unloading component orted\
[grsacc20:04946] [[6816,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:465\
[grsacc20:04946] [[6816,0],0] ACTIVATING JOB NULL STATE DAEMONS TERMINATED PRI 0\
[grsacc20:04946] mca: base: close: component default_hnp closed\
[grsacc20:04946] mca: base: close: unloading component default_hnp\
[grsacc20:04946] mca: base: close: component hnp closed\
[grsacc20:04946] mca: base: close: unloading component hnp\
-bash-4.1$ [grsacc18:16957] [[6816,0],2] ACTIVATE PROC [[6816,0],0] STATE LIFELINE LOST AT oob_tcp_component.c:1102\
[grsacc18:16957] [[6816,0],2] ACTIVATING PROC [[6816,0],0] STATE LIFELINE LOST PRI 0\
[grsacc18:16957] [[6816,0],2] errmgr:default_orted:proc_errors process [[6816,0],0] error state LIFELINE LOST\
[grsacc18:16957] [[6816,0],2] errmgr:orted lifeline lost - exiting\
[grsacc18:16957] [[6816,0],2] FORCE-TERMINATE AT errmgr_default_orted.c:259\
[grsacc18:16957] [[6816,0],2] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:259\
[grsacc18:16957] [[6816,0],2] ACTIVATING JOB NULL STATE FORCED EXIT PRI 0\
[grsacc18:16957] mca: base: close: component default_orted closed\
[grsacc18:16957] mca: base: close: unloading component default_orted\
[grsacc18:16957] mca: base: close: component orted closed\
[grsacc18:16957] mca: base: close: unloading component orted}

On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote:

> Afraid I don't see the problem offhand - can you add the following to your 
> cmd line?
> 
> -mca state_base_verbose 10 -mca errmgr_base_verbose 10
> 
> Thanks
> Ralph
> 
> On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
> wrote:
> 
>> Hi Ralph, 
>> 
>> I always got this output from any MPI job that ran on our nodes. There seems 
>> to be a problem somewhere but it never stopped the applications from 
>> running. But anyway, I ran it again now with only tcp and excluded the 
>> infiniband and I get the same output again. Except that this time, the error 
>> related to this openib is not there anymore. Printing out the log again. 
>> 
>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from 
>> [[6160,1],0]
>> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts
>> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn
>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>> [grsacc20:04578] [[6160,0],0] plm:base:setup_job
>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm
>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2]
>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon 
>> [[6160,0],2] to node grsacc18
>> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm
>> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv:
>>      orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 
>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
>> tcp,sm,self
>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19
>> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>>      orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 
>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
>> tcp,sm,self
>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18
>> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>>      orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 
>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
>> tcp,sm,self
>> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds
>> [grsacc19:28821] mca:base:select:(  plm) Querying component [rsh]
>> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
>> [grsacc19:28821] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [grsacc19:28821] mca:base:select:(  plm) Selected component [rsh]
>> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm
>> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm
>> [grsacc18:16717] mca:base:select:(  plm) Querying component [rsh]
>> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
>> [grsacc18:16717] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [grsacc18:16717] mca:base:select:(  plm) Selected component [rsh]
>> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm
>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
>> [[6160,0],2]
>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
>> [[6160,0],2] on node grsacc18
>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for 
>> daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229
>> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2]
>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command 
>> from [[6160,0],2]
>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job 
>> [6160,2]
>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for 
>> vpid 0 state RUNNING exit_code 0
>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2]
>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>> [grsacc20:04578] [[6160,0],0] plm:base:launch registered event
>> [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job 
>> [6160,2] to [[6160,1],0]
>> [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands
>> [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm
>> [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm
>> -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm
>> 
>> Best,
>> Suraj
>> On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote:
>> 
>>> Your output shows that it launched your apps, but they exited. The error is 
>>> reported here, though it appears we aren't flushing the message out before 
>>> exiting due to a race condition:
>>> 
>>>> [grsacc20:04511] 1 more process has sent help message 
>>>> help-mpi-btl-openib.txt / no active ports found
>>> 
>>> Here is the full text:
>>> [no active ports found]
>>> WARNING: There is at least non-excluded one OpenFabrics device found,
>>> but there are no active ports detected (or Open MPI was unable to use
>>> them).  This is most certainly not what you wanted.  Check your
>>> cables, subnet manager configuration, etc.  The openib BTL will be
>>> ignored for this job.
>>> 
>>> Local host: %s
>>> 
>>> Looks like at least one node being used doesn't have an active Infiniband 
>>> port on it?
>>> 
>>> 
>>> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran 
>>> <suraj.prabhaka...@gmail.com> wrote:
>>> 
>>>> Hi Ralph,
>>>> 
>>>> I tested it with the trunk r29228. I still have the following problem. 
>>>> Now, it even spawns the daemon on the new node through torque but then 
>>>> suddently quits. The following is the output. Can you please have a look? 
>>>> 
>>>> Thanks
>>>> Suraj
>>>> 
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from 
>>>> [[6253,1],0]
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job
>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm
>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2]
>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon 
>>>> [[6253,0],2] to node grsacc18
>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm
>>>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv:
>>>>    orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 
>>>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19
>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>>    orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 
>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18
>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>>    orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 
>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds
>>>> [grsacc19:28754] mca:base:select:(  plm) Querying component [rsh]
>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
>>>> [grsacc19:28754] mca:base:select:(  plm) Query of component [rsh] set 
>>>> priority to 10
>>>> [grsacc19:28754] mca:base:select:(  plm) Selected component [rsh]
>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm
>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm
>>>> [grsacc18:16648] mca:base:select:(  plm) Querying component [rsh]
>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
>>>> [grsacc18:16648] mca:base:select:(  plm) Query of component [rsh] set 
>>>> priority to 10
>>>> [grsacc18:16648] mca:base:select:(  plm) Selected component [rsh]
>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>>>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm
>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>>>> [[6253,0],2]
>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>>>> [[6253,0],2] on node grsacc18
>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for 
>>>> daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974
>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2]
>>>> [grsacc20:04511] 1 more process has sent help message 
>>>> help-mpi-btl-openib.txt / no active ports found
>>>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>>>> all help / error messages
>>>> [grsacc20:04511] 1 more process has sent help message 
>>>> help-mpi-btl-base.txt / btl:no-nics
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command 
>>>> from [[6253,0],2]
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for 
>>>> job [6253,2]
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for 
>>>> vpid 0 state RUNNING exit_code 0
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job 
>>>> [6253,2]
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event
>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job 
>>>> [6253,2] to [[6253,1],0]
>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit 
>>>> commands
>>>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm
>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm
>>>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote:
>>>> 
>>>>> Found a bug in the Torque support - we were trying to connect to the MOM 
>>>>> again, which would hang (I imagine). I pushed a fix to the trunk (r29227) 
>>>>> and scheduled it to come to 1.7.3 if you want to try it again.
>>>>> 
>>>>> 
>>>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran 
>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>> 
>>>>>> Dear Ralph,
>>>>>> 
>>>>>> This is the output I get when I execute with the verbose option.
>>>>>> 
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from 
>>>>>> [[23526,1],0]
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>>>> [[23526,0],2]
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>>>> [[23526,0],2] to node grsacc17/1-4
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>>>> [[23526,0],3]
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>>>> [[23526,0],3] to node grsacc17/0-5
>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm
>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv:
>>>>>>  orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid 
>>>>>> <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri 
>>>>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5
>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only 
>>>>>> one event_base_loop can run on each event_base at once.
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit 
>>>>>> commands
>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm
>>>>>> 
>>>>>> Says something?
>>>>>> 
>>>>>> Best,
>>>>>> Suraj
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote:
>>>>>> 
>>>>>>> I'll still need to look at the intercomm_create issue, but I just 
>>>>>>> tested both the trunk and current 1.7.3 branch for "add-host" and both 
>>>>>>> worked just fine. This was on my little test cluster which only has rsh 
>>>>>>> available - no Torque.
>>>>>>> 
>>>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some 
>>>>>>> debug output as to the problem.
>>>>>>> 
>>>>>>> 
>>>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran 
>>>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Dear all,
>>>>>>>>> 
>>>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to 
>>>>>>>>> check if it works for my case and as of revision 29215, it works for 
>>>>>>>>> the original case I reported. Although it works, I still see the 
>>>>>>>>> following in the output. Does it mean anything?
>>>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]
>>>>>>>> 
>>>>>>>> Yes - it means we don't quite have this right yet :-(
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> However, on another topic relevant to my use case, I have another 
>>>>>>>>> problem to report. I am having problems using the "add-host" info to 
>>>>>>>>> the MPI_Comm_spawn() when MPI is compiled with support for Torque 
>>>>>>>>> resource manager. This problem is totally new in the 1.7 series and 
>>>>>>>>> it worked perfectly until 1.6.5 
>>>>>>>>> 
>>>>>>>>> Basically, I am working on implementing dynamic resource management 
>>>>>>>>> facilities in the Torque/Maui batch system. Through a new tm call, an 
>>>>>>>>> application can get new resources for a job.
>>>>>>>> 
>>>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to 
>>>>>>>> support precisely that operation. It allows an application to request 
>>>>>>>> that we dynamically obtain additional resources during execution 
>>>>>>>> (e.g., as part of a Comm_spawn call via an info_key). We originally 
>>>>>>>> implemented this with Slurm, but you could add the calls into the 
>>>>>>>> Torque component as well if you like.
>>>>>>>> 
>>>>>>>> This is in the trunk now - will come over to 1.7.4
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new 
>>>>>>>>> hosts. With my extended torque/maui batch system, I was able to 
>>>>>>>>> perfectly use the "add-host" info argument to MPI_Comm_spawn() to 
>>>>>>>>> spawn new processes on these hosts. Since MPI and Torque refer to the 
>>>>>>>>> hosts through the nodeids, I made sure that OpenMPI uses the correct 
>>>>>>>>> nodeid's for these new hosts. 
>>>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the 
>>>>>>>>> Intercomm_merge problem, I could not really run a real application to 
>>>>>>>>> its completion.
>>>>>>>>> 
>>>>>>>>> While this is now fixed in the trunk, I found that, however, when 
>>>>>>>>> using the "add-host" info argument, everything collapses after 
>>>>>>>>> printing out the following error. 
>>>>>>>>> 
>>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only 
>>>>>>>>> one event_base_loop can run on each event_base at once.
>>>>>>>> 
>>>>>>>> I'll take a look - probably some stale code that hasn't been updated 
>>>>>>>> yet for async ORTE operations
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> And due to this, I am still not really able to run my application! I 
>>>>>>>>> also compiled the MPI without any Torque/PBS support and just used 
>>>>>>>>> the "add-host" argument normally. Again, this worked perfectly in 
>>>>>>>>> 1.6.5. But in the 1.7 series, it works but after printing out the 
>>>>>>>>> following error.
>>>>>>>>> 
>>>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>>>> 
>>>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we 
>>>>>>>> "illegally" re-enter libevent. The error again means we don't have 
>>>>>>>> Intercomm_create correct just yet.
>>>>>>>> 
>>>>>>>> I'll see what I can do about this and get back to you
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque 
>>>>>>>>> support, it runs after spitting the above lines. 
>>>>>>>>> 
>>>>>>>>> I would really appreciate some help on this, since I need these 
>>>>>>>>> features to actually test my case and (at least in my short 
>>>>>>>>> experience) no other MPI implementation seem friendly to such dynamic 
>>>>>>>>> scenarios. 
>>>>>>>>> 
>>>>>>>>> Thanks a lot!
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Suraj
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
>>>>>>>>> 
>>>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all works 
>>>>>>>>>> for me.  Thanks!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks George - much appreciated
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> The test case was broken. I just pushed a fix.
>>>>>>>>>>>> 
>>>>>>>>>>>> George.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hangs with any np > 1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the 
>>>>>>>>>>>>> underlying implementation
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Sent from my phone. No type good. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one 
>>>>>>>>>>>>>>> difference - I only run it with np=1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) 
>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca 
>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must 
>>>>>>>>>>>>>>>>> have another network enabled.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I know :-).  I have tcp available as well (OMPI will abort if 
>>>>>>>>>>>>>>>> you only run with sm,self because the comm_spawn will fail 
>>>>>>>>>>>>>>>> with unreachable errors -- I just tested/proved this to 
>>>>>>>>>>>>>>>> myself).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an 
>>>>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without 
>>>>>>>>>>>>>>>>> xterm support. Instead try using the test case from the 
>>>>>>>>>>>>>>>>> trunk, the one committed by Ralph.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok.  :-)  I ran 
>>>>>>>>>>>>>>>> with orte/test/mpi/intercomm_create.c, and that hangs for me 
>>>>>>>>>>>>>>>> as well:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> George.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> George --
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your 
>>>>>>>>>>>>>>>>>> attached test case hangs:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>> 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>> 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>> 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>> 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create   
>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca 
>>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch 
>>>>>>>>>>>>>>>>>>> that addresses the MPI_Intercomm issue at the MPI level. It 
>>>>>>>>>>>>>>>>>>> should be applied after removal of 29166.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner 
>>>>>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and 
>>>>>>>>>>>>>>>>>>> doing a clean disconnect.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Jeff Squyres
>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to