Hi Ralph, Output attached in a file. Thanks a lot!
Best, Suraj
{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360 {\fonttbl\f0\fswiss\fcharset0 Helvetica;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww30340\viewh23120\viewkind0 \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural \f0\fs24 \cf0 [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_tm_module.c:157\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:315\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE INIT_COMPLETE PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:326\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING ALLOCATION PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:421\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE ALLOCATION COMPLETE PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:182\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING DAEMON LAUNCH PRI 4\ [grsacc19:29071] mca: base: components_register: registering state components\ [grsacc19:29071] mca: base: components_register: found loaded component app\ [grsacc19:29071] mca: base: components_register: component app has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component hnp\ [grsacc19:29071] mca: base: components_register: component hnp has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component novm\ [grsacc19:29071] mca: base: components_register: component novm register function successful\ [grsacc19:29071] mca: base: components_register: found loaded component orted\ [grsacc19:29071] mca: base: components_register: component orted has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component staged_hnp\ [grsacc19:29071] mca: base: components_register: component staged_hnp has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component staged_orted\ [grsacc19:29071] mca: base: components_register: component staged_orted has no register or open function\ [grsacc19:29071] mca: base: components_open: opening state components\ [grsacc19:29071] mca: base: components_open: found loaded component app\ [grsacc19:29071] mca: base: components_open: component app open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component hnp\ [grsacc19:29071] mca: base: components_open: component hnp open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component novm\ [grsacc19:29071] mca: base: components_open: component novm open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component orted\ [grsacc19:29071] mca: base: components_open: component orted open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component staged_hnp\ [grsacc19:29071] mca: base: components_open: component staged_hnp open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component staged_orted\ [grsacc19:29071] mca: base: components_open: component staged_orted open function successful\ [grsacc19:29071] mca:base:select: Auto-selecting state components\ [grsacc19:29071] mca:base:select:(state) Querying component [app]\ [grsacc19:29071] mca:base:select:(state) Skipping component [app]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [hnp]\ [grsacc19:29071] mca:base:select:(state) Skipping component [hnp]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [novm]\ [grsacc19:29071] mca:base:select:(state) Skipping component [novm]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [orted]\ [grsacc19:29071] mca:base:select:(state) Query of component [orted] set priority to 100\ [grsacc19:29071] mca:base:select:(state) Querying component [staged_hnp]\ [grsacc19:29071] mca:base:select:(state) Skipping component [staged_hnp]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [staged_orted]\ [grsacc19:29071] mca:base:select:(state) Skipping component [staged_orted]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Selected component [orted]\ [grsacc19:29071] mca: base: close: component app closed\ [grsacc19:29071] mca: base: close: unloading component app\ [grsacc19:29071] mca: base: close: component hnp closed\ [grsacc19:29071] mca: base: close: unloading component hnp\ [grsacc19:29071] mca: base: close: component novm closed\ [grsacc19:29071] mca: base: close: unloading component novm\ [grsacc19:29071] mca: base: close: component staged_hnp closed\ [grsacc19:29071] mca: base: close: unloading component staged_hnp\ [grsacc19:29071] mca: base: close: component staged_orted closed\ [grsacc19:29071] mca: base: close: unloading component staged_orted\ [grsacc19:29071] ORTE_JOB_STATE_MACHINE:\ [grsacc19:29071] State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED\ [grsacc19:29071] State: FORCED EXIT cbfunc: DEFINED\ [grsacc19:29071] State: DAEMONS TERMINATED cbfunc: DEFINED\ [grsacc19:29071] ORTE_PROC_STATE_MACHINE:\ [grsacc19:29071] State: RUNNING cbfunc: DEFINED\ [grsacc19:29071] State: SYNC REGISTERED cbfunc: DEFINED\ [grsacc19:29071] State: IOF COMPLETE cbfunc: DEFINED\ [grsacc19:29071] State: WAITPID FIRED cbfunc: DEFINED\ [grsacc19:29071] mca: base: components_register: registering errmgr components\ [grsacc19:29071] mca: base: components_register: found loaded component default_app\ [grsacc19:29071] mca: base: components_register: component default_app register function successful\ [grsacc19:29071] mca: base: components_register: found loaded component default_hnp\ [grsacc19:29071] mca: base: components_register: component default_hnp register function successful\ [grsacc19:29071] mca: base: components_register: found loaded component default_orted\ [grsacc19:29071] mca: base: components_register: component default_orted register function successful\ [grsacc19:29071] mca: base: components_open: opening errmgr components\ [grsacc19:29071] mca: base: components_open: found loaded component default_app\ [grsacc19:29071] mca: base: components_open: component default_app open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component default_hnp\ [grsacc19:29071] mca: base: components_open: component default_hnp open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component default_orted\ [grsacc19:29071] mca: base: components_open: component default_orted open function successful\ [grsacc19:29071] mca:base:select: Auto-selecting errmgr components\ [grsacc19:29071] mca:base:select:(errmgr) Querying component [default_app]\ [grsacc19:29071] mca:base:select:(errmgr) Skipping component [default_app]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(errmgr) Querying component [default_hnp]\ [grsacc19:29071] mca:base:select:(errmgr) Skipping component [default_hnp]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(errmgr) Querying component [default_orted]\ [grsacc19:29071] mca:base:select:(errmgr) Query of component [default_orted] set priority to 1000\ [grsacc19:29071] mca:base:select:(errmgr) Selected component [default_orted]\ [grsacc19:29071] mca: base: close: component default_app closed\ [grsacc19:29071] mca: base: close: unloading component default_app\ [grsacc19:29071] mca: base: close: component default_hnp closed\ [grsacc19:29071] mca: base: close: unloading component default_hnp\ [grsacc19:29071] [[6816,0],1] FORCE-TERMINATE AT oob_tcp_sendrecv.c:430\ [grsacc19:29066] [[6816,0],1] ACTIVATE PROC [[6816,0],0] STATE LIFELINE LOST AT oob_tcp_component.c:1102\ [grsacc19:29071] [[6816,0],1] ACTIVATE JOB NULL STATE FORCED EXIT AT oob_tcp_sendrecv.c:430\ [grsacc19:29071] [[6816,0],1] ACTIVATING JOB NULL STATE FORCED EXIT PRI 0\ [grsacc19:29071] mca: base: close: component default_orted closed\ [grsacc19:29066] [[6816,0],1] ACTIVATING PROC [[6816,0],0] STATE LIFELINE LOST PRI 0\ [grsacc19:29066] [[6816,0],1] errmgr:default_orted:proc_errors process [[6816,0],0] error state LIFELINE LOST\ [grsacc19:29066] [[6816,0],1] errmgr:orted lifeline lost - ex[grsacc19:29071] mca: base: close: unloading component default_orted\ [grsacc19:29071] mca: base: close: component orted closed\ [grsacc19:29071] mca: base: close: unloading component orted\ iting\ [grsacc18:16957] mca: base: components_register: registering state components\ [grsacc18:16957] mca: base: components_register: found loaded component app\ [grsacc18:16957] mca: base: components_register: component app has no register or open function\ [grsacc18:16957] mca: base: components_register: found loaded component hnp\ [grsacc18:16957] mca: base: components_register: component hnp has no register or open function\ [grsacc18:16957] mca: base: components_register: found loaded component novm\ [grsacc18:16957] mca: base: components_register: component novm register function successful\ [grsacc18:16957] mca: base: components_register: found loaded component orted\ [grsacc18:16957] mca: base: components_register: component orted has no register or open function\ [grsacc18:16957] mca: base: components_register: found loaded component staged_hnp\ [grsacc18:16957] mca: base: components_register: component staged_hnp has no register or open function\ [grsacc18:16957] mca: base: components_register: found loaded component staged_orted\ [grsacc18:16957] mca: base: components_register: component staged_orted has no register or open function\ [grsacc18:16957] mca: base: components_open: opening state components\ [grsacc18:16957] mca: base: components_open: found loaded component app\ [grsacc18:16957] mca: base: components_open: component app open function successful\ [grsacc18:16957] mca: base: components_open: found loaded component hnp\ [grsacc18:16957] mca: base: components_open: component hnp open function successful\ [grsacc18:16957] mca: base: components_open: found loaded component novm\ [grsacc18:16957] mca: base: components_open: component novm open function successful\ [grsacc18:16957] mca: base: components_open: found loaded component orted\ [grsacc18:16957] mca: base: components_open: component orted open function successful\ [grsacc18:16957] mca: base: components_open: found loaded component staged_hnp\ [grsacc18:16957] mca: base: components_open: component staged_hnp open function successful\ [grsacc18:16957] mca: base: components_open: found loaded component staged_orted\ [grsacc18:16957] mca: base: components_open: component staged_orted open function successful\ [grsacc18:16957] mca:base:select: Auto-selecting state components\ [grsacc18:16957] mca:base:select:(state) Querying component [app]\ [grsacc18:16957] mca:base:select:(state) Skipping component [app]. Query failed to return a module\ [grsacc18:16957] mca:base:select:(state) Querying component [hnp]\ [grsacc18:16957] mca:base:select:(state) Skipping component [hnp]. Query failed to return a module\ [grsacc18:16957] mca:base:select:(state) Querying component [novm]\ [grsacc18:16957] mca:base:select:(state) Skipping component [novm]. Query failed to return a module\ [grsacc18:16957] mca:base:select:(state) Querying component [orted]\ [grsacc18:16957] mca:base:select:(state) Query of component [orted] set priority to 100\ [grsacc18:16957] mca:base:select:(state) Querying component [staged_hnp]\ [grsacc18:16957] mca:base:select:(state) Skipping component [staged_hnp]. Query failed to return a module\ [grsacc18:16957] mca:base:select:(state) Querying component [staged_orted]\ [grsacc18:16957] mca:base:select:(state) Skipping component [staged_orted]. Query failed to return a module\ [grsacc18:16957] mca:base:select:(state) Selected component [orted]\ [grsacc18:16957] mca: base: close: component app closed\ [grsacc18:16957] mca: base: close: unloading component app\ [grsacc18:16957] mca: base: close: component hnp closed\ [grsacc18:16957] mca: base: close: unloading component hnp\ [grsacc18:16957] mca: base: close: component novm closed\ [grsacc18:16957] mca: base: close: unloading component novm\ [grsacc18:16957] mca: base: close: component staged_hnp closed\ [grsacc18:16957] mca: base: close: unloading component staged_hnp\ [grsacc18:16957] mca: base: close: component staged_orted closed\ [grsacc18:16957] mca: base: close: unloading component staged_orted\ [grsacc18:16957] ORTE_JOB_STATE_MACHINE:\ [grsacc18:16957] State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED\ [grsacc18:16957] State: FORCED EXIT cbfunc: DEFINED\ [grsacc18:16957] State: DAEMONS TERMINATED cbfunc: DEFINED\ [grsacc18:16957] ORTE_PROC_STATE_MACHINE:\ [grsacc18:16957] State: RUNNING cbfunc: DEFINED\ [grsacc18:16957] State: SYNC REGISTERED cbfunc: DEFINED\ [grsacc18:16957] State: IOF COMPLETE cbfunc: DEFINED\ [grsacc18:16957] State: WAITPID FIRED cbfunc: DEFINED\ [grsacc18:16957] mca: base: components_register: registering errmgr components\ [grsacc18:16957] mca: base: components_register: found loaded component default_app\ [grsacc18:16957] mca: base: components_register: component default_app register function successful\ [grsacc18:16957] mca: base: components_register: found loaded component default_hnp\ [grsacc18:16957] mca: base: components_register: component default_hnp register function successful\ [grsacc18:16957] mca: base: components_register: found loaded component default_orted\ [grsacc18:16957] mca: base: components_register: component default_orted register function successful\ [grsacc18:16957] mca: base: components_open: opening errmgr components\ [grsacc18:16957] mca: base: components_open: found loaded component default_app\ [grsacc18:16957] mca: base: components_open: component default_app open function successful\ [grsacc18:16957] mca: base: components_open: found loaded component default_hnp\ [grsacc18:16957] mca: base: components_open: component default_hnp open function successful\ [grsacc18:16957] mca: base: components_open: found loaded component default_orted\ [grsacc18:16957] mca: base: components_open: component default_orted open function successful\ [grsacc18:16957] mca:base:select: Auto-selecting errmgr components\ [grsacc18:16957] mca:base:select:(errmgr) Querying component [default_app]\ [grsacc18:16957] mca:base:select:(errmgr) Skipping component [default_app]. Query failed to return a module\ [grsacc18:16957] mca:base:select:(errmgr) Querying component [default_hnp]\ [grsacc18:16957] mca:base:select:(errmgr) Skipping component [default_hnp]. Query failed to return a module\ [grsacc18:16957] mca:base:select:(errmgr) Querying component [default_orted]\ [grsacc18:16957] mca:base:select:(errmgr) Query of component [default_orted] set priority to 1000\ [grsacc18:16957] mca:base:select:(errmgr) Selected component [default_orted]\ [grsacc18:16957] mca: base: close: component default_app closed\ [grsacc18:16957] mca: base: close: unloading component default_app\ [grsacc18:16957] mca: base: close: component default_hnp closed\ [grsacc18:16957] mca: base: close: unloading component default_hnp\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE ALL DAEMONS REPORTED AT base/plm_base_launch_support.c:842\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE ALL DAEMONS REPORTED PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE VM READY AT base/plm_base_launch_support.c:170\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE VM READY PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING MAPPING AT base/plm_base_launch_support.c:207\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING MAPPING PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:316\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE MAP COMPLETE PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:233\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING FINAL SYSTEM PREP PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:410\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING APP LAUNCH PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1593\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE LOCAL LAUNCH COMPLETE PRI 4\ [grsacc18:16957] [[6816,0],2] ACTIVATE PROC [[6816,2],0] STATE RUNNING AT base/odls_base_default_fns.c:1545\ [grsacc18:16957] [[6816,0],2] ACTIVATING PROC [[6816,2],0] STATE RUNNING PRI 4\ [grsacc18:16957] [[6816,0],2] ACTIVATE JOB [6816,2] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1593\ [grsacc18:16957] [[6816,0],2] ACTIVATING JOB [6816,2] STATE LOCAL LAUNCH COMPLETE PRI 4\ [grsacc18:16957] [[6816,0],2] state:orted:track_procs called for proc [[6816,2],0] state RUNNING\ [grsacc18:16957] [[6816,0],2] state:orted:track_jobs sending local launch complete for job [6816,2]\ [grsacc20:04946] [[6816,0],0] ACTIVATE PROC [[6816,2],0] STATE RUNNING AT base/plm_base_receive.c:296\ [grsacc20:04946] [[6816,0],0] ACTIVATING PROC [[6816,2],0] STATE RUNNING PRI 4\ [grsacc18:16959] mca: base: components_register: registering state components\ [grsacc18:16959] mca: base: components_register: found loaded component app\ [grsacc18:16959] mca: base: components_register: component app has no register or open function\ [grsacc18:16959] mca: base: components_register: found loaded component hnp\ [grsacc18:16959] mca: base: components_register: component hnp has no register or open function\ [grsacc18:16959] mca: base: components_register: found loaded component novm\ [grsacc18:16959] mca: base: components_register: component novm register function successful\ [grsacc18:16959] mca: base: components_register: found loaded component orted\ [grsacc18:16959] mca: base: components_register: component orted has no register or open function\ [grsacc18:16959] mca: base: components_register: found loaded component staged_hnp\ [grsacc18:16959] mca: base: components_register: component staged_hnp has no register or open function\ [grsacc18:16959] mca: base: components_register: found loaded component staged_orted\ [grsacc18:16959] mca: base: components_register: component staged_orted has no register or open function\ [grsacc18:16959] mca: base: components_open: opening state components\ [grsacc18:16959] mca: base: components_open: found loaded component app\ [grsacc18:16959] mca: base: components_open: component app open function successful\ [grsacc18:16959] mca: base: components_open: found loaded component hnp\ [grsacc18:16959] mca: base: components_open: component hnp open function successful\ [grsacc18:16959] mca: base: components_open: found loaded component novm\ [grsacc18:16959] mca: base: components_open: component novm open function successful\ [grsacc18:16959] mca: base: components_open: found loaded component orted\ [grsacc18:16959] mca: base: components_open: component orted open function successful\ [grsacc18:16959] mca: base: components_open: found loaded component staged_hnp\ [grsacc18:16959] mca: base: components_open: component staged_hnp open function successful\ [grsacc18:16959] mca: base: components_open: found loaded component staged_orted\ [grsacc18:16959] mca: base: components_open: component staged_orted open function successful\ [grsacc18:16959] mca:base:select: Auto-selecting state components\ [grsacc18:16959] mca:base:select:(state) Querying component [app]\ [grsacc18:16959] mca:base:select:(state) Query of component [app] set priority to 1000\ [grsacc18:16959] mca:base:select:(state) Querying component [hnp]\ [grsacc18:16959] mca:base:select:(state) Skipping component [hnp]. Query failed to return a module\ [grsacc18:16959] mca:base:select:(state) Querying component [novm]\ [grsacc18:16959] mca:base:select:(state) Skipping component [novm]. Query failed to return a module\ [grsacc18:16959] mca:base:select:(state) Querying component [orted]\ [grsacc18:16959] mca:base:select:(state) Skipping component [orted]. Query failed to return a module\ [grsacc18:16959] mca:base:select:(state) Querying component [staged_hnp]\ [grsacc18:16959] mca:base:select:(state) Skipping component [staged_hnp]. Query failed to return a module\ [grsacc18:16959] mca:base:select:(state) Querying component [staged_orted]\ [grsacc18:16959] mca:base:select:(state) Skipping component [staged_orted]. Query failed to return a module\ [grsacc18:16959] mca:base:select:(state) Selected component [app]\ [grsacc18:16959] mca: base: close: component hnp closed\ [grsacc18:16959] mca: base: close: unloading component hnp\ [grsacc18:16959] mca: base: close: component novm closed\ [grsacc18:16959] mca: base: close: unloading component novm\ [grsacc18:16959] mca: base: close: component orted closed\ [grsacc18:16959] mca: base: close: unloading component orted\ [grsacc18:16959] mca: base: close: component staged_hnp closed\ [grsacc18:16959] mca: base: close: unloading component staged_hnp\ [grsacc18:16959] mca: base: close: component staged_orted closed\ [grsacc18:16959] mca: base: close: unloading component staged_orted\ [grsacc20:04946] [[6816,0],0] state:base:track_procs called for proc [[6816,2],0] state RUNNING\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE RUNNING AT base/state_base_fns.c:482\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE RUNNING PRI 4\ [grsacc18:16959] mca: base: components_register: registering errmgr components\ [grsacc18:16959] mca: base: components_register: found loaded component default_app\ [grsacc18:16959] mca: base: components_register: component default_app register function successful\ [grsacc18:16959] mca: base: components_register: found loaded component default_hnp\ [grsacc18:16959] mca: base: components_register: component default_hnp register function successful\ [grsacc18:16959] mca: base: components_register: found loaded component default_orted\ [grsacc18:16959] mca: base: components_register: component default_orted register function successful\ [grsacc18:16959] mca: base: components_open: opening errmgr components\ [grsacc18:16959] mca: base: components_open: found loaded component default_app\ [grsacc18:16959] mca: base: components_open: component default_app open function successful\ [grsacc18:16959] mca: base: components_open: found loaded component default_hnp\ [grsacc18:16959] mca: base: components_open: component default_hnp open function successful\ [grsacc18:16959] mca: base: components_open: found loaded component default_orted\ [grsacc18:16959] mca: base: components_open: component default_orted open function successful\ [grsacc18:16959] mca:base:select: Auto-selecting errmgr components\ [grsacc18:16959] mca:base:select:(errmgr) Querying component [default_app]\ [grsacc18:16959] mca:base:select:(errmgr) Query of component [default_app] set priority to 1000\ [grsacc18:16959] mca:base:select:(errmgr) Querying component [default_hnp]\ [grsacc18:16959] mca:base:select:(errmgr) Skipping component [default_hnp]. Query failed to return a module\ [grsacc18:16959] mca:base:select:(errmgr) Querying component [default_orted]\ [grsacc18:16959] mca:base:select:(errmgr) Skipping component [default_orted]. Query failed to return a module\ [grsacc18:16959] mca:base:select:(errmgr) Selected component [default_app]\ [grsacc18:16959] mca: base: close: component default_hnp closed\ [grsacc18:16959] mca: base: close: unloading component default_hnp\ [grsacc18:16959] mca: base: close: component default_orted closed\ [grsacc18:16959] mca: base: close: unloading component default_orted\ [grsacc18:16957] [[6816,0],2] ACTIVATE PROC [[6816,2],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1836\ [grsacc18:1[grsacc20:04946] [[6816,0],0] ACTIVATE PROC [[6816,2],0] STATE SYNC REGISTERED AT base/plm_base_receive.c:354\ 6957] [[grsacc20:04946] [[6816,0],0] ACTIVATING PROC [[6816,2],0] STATE SYNC REGISTERED PRI 4\ [6816,0],2] ACTIVATI[grsacc20:04946] [[6816,0],0] state:base:track_procs called for proc [[6816,2],0] state SYNC REGISTERED\ NG PR[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE SYNC REGISTERED AT base/state_base_fns.c:490\ OC [[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE SYNC REGISTERED PRI 4\ [6816,2],0] STA[grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:609\ TE SY[grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE READY FOR DEBUGGERS PRI 4\ NC REGISTERED PRI 4\ [grsacc18:16957] [[6816,0],2] state:orted:track_procs called for proc [[6816,2],0] state SYNC REGISTERED\ [grsacc18:16957] [[6816,0],2] state:orted: sending contact info to HNP\ [grsacc20:04946] [[6816,0],0] ACTIVATE PROC [[6816,0],1] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:1301\ [grsacc20:04946] [[6816,0],0] ACTIVATING PROC [[6816,0],1] STATE COMMUNICATION FAILURE PRI 0\ [grsacc20:04946] [[6816,0],0] errmgr:default_hnp: for proc [[6816,0],1] state COMMUNICATION FAILURE\ [grsacc20:04946] [[6816,0],0] Comm failure: daemon [[6816,0],1] - aborting\ [grsacc20:04946] [[6816,0],0] errmgr:default_hnp: abort called on job [6816,0]\ [grsacc20:04946] [[6816,0],0] errmgr:default_hnp: ordering orted termination\ [grsacc19:29066] [[6816,0],1] FORCE-TERMINATE AT errmgr_default_orted.c:259\ [grsacc19:29066] [[6816,0],1] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:259\ [grsacc19:29066] [[6816,0],1] ACTIVATING JOB NULL STATE FORCED EXIT PRI 0\ [grsacc19:29066] mca: base: close: component default_orted closed\ [grsacc19:29066] mca: base: close: unloading component default_orted\ [grsacc19:29066] mca: base: close: component orted closed\ [grsacc19:29066] mca: base: close: unloading component orted\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:465\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB NULL STATE DAEMONS TERMINATED PRI 0\ [grsacc20:04946] mca: base: close: component default_hnp closed\ [grsacc20:04946] mca: base: close: unloading component default_hnp\ [grsacc20:04946] mca: base: close: component hnp closed\ [grsacc20:04946] mca: base: close: unloading component hnp\ -bash-4.1$ [grsacc18:16957] [[6816,0],2] ACTIVATE PROC [[6816,0],0] STATE LIFELINE LOST AT oob_tcp_component.c:1102\ [grsacc18:16957] [[6816,0],2] ACTIVATING PROC [[6816,0],0] STATE LIFELINE LOST PRI 0\ [grsacc18:16957] [[6816,0],2] errmgr:default_orted:proc_errors process [[6816,0],0] error state LIFELINE LOST\ [grsacc18:16957] [[6816,0],2] errmgr:orted lifeline lost - exiting\ [grsacc18:16957] [[6816,0],2] FORCE-TERMINATE AT errmgr_default_orted.c:259\ [grsacc18:16957] [[6816,0],2] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:259\ [grsacc18:16957] [[6816,0],2] ACTIVATING JOB NULL STATE FORCED EXIT PRI 0\ [grsacc18:16957] mca: base: close: component default_orted closed\ [grsacc18:16957] mca: base: close: unloading component default_orted\ [grsacc18:16957] mca: base: close: component orted closed\ [grsacc18:16957] mca: base: close: unloading component orted}
On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote: > Afraid I don't see the problem offhand - can you add the following to your > cmd line? > > -mca state_base_verbose 10 -mca errmgr_base_verbose 10 > > Thanks > Ralph > > On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> > wrote: > >> Hi Ralph, >> >> I always got this output from any MPI job that ran on our nodes. There seems >> to be a problem somewhere but it never stopped the applications from >> running. But anyway, I ran it again now with only tcp and excluded the >> infiniband and I get the same output again. Except that this time, the error >> related to this openib is not there anymore. Printing out the log again. >> >> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from >> [[6160,1],0] >> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts >> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn >> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >> [grsacc20:04578] [[6160,0],0] plm:base:setup_job >> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm >> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] >> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon >> [[6160,0],2] to node grsacc18 >> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm >> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: >> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid >> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >> tcp,sm,self >> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 >> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 >> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >> tcp,sm,self >> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 >> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 >> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >> tcp,sm,self >> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds >> [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] >> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >> [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] >> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL >> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm >> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm >> [grsacc18:16717] mca:base:select:( plm) Querying component [rsh] >> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >> [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [grsacc18:16717] mca:base:select:( plm) Selected component [rsh] >> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL >> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm >> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >> [[6160,0],2] >> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >> [[6160,0],2] on node grsacc18 >> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for >> daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229 >> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2] >> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command >> from [[6160,0],2] >> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job >> [6160,2] >> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for >> vpid 0 state RUNNING exit_code 0 >> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2] >> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >> [grsacc20:04578] [[6160,0],0] plm:base:launch registered event >> [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job >> [6160,2] to [[6160,1],0] >> [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands >> [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm >> [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm >> -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm >> >> Best, >> Suraj >> On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote: >> >>> Your output shows that it launched your apps, but they exited. The error is >>> reported here, though it appears we aren't flushing the message out before >>> exiting due to a race condition: >>> >>>> [grsacc20:04511] 1 more process has sent help message >>>> help-mpi-btl-openib.txt / no active ports found >>> >>> Here is the full text: >>> [no active ports found] >>> WARNING: There is at least non-excluded one OpenFabrics device found, >>> but there are no active ports detected (or Open MPI was unable to use >>> them). This is most certainly not what you wanted. Check your >>> cables, subnet manager configuration, etc. The openib BTL will be >>> ignored for this job. >>> >>> Local host: %s >>> >>> Looks like at least one node being used doesn't have an active Infiniband >>> port on it? >>> >>> >>> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran >>> <suraj.prabhaka...@gmail.com> wrote: >>> >>>> Hi Ralph, >>>> >>>> I tested it with the trunk r29228. I still have the following problem. >>>> Now, it even spawns the daemon on the new node through torque but then >>>> suddently quits. The following is the output. Can you please have a look? >>>> >>>> Thanks >>>> Suraj >>>> >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from >>>> [[6253,1],0] >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job >>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm >>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2] >>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon >>>> [[6253,0],2] to node grsacc18 >>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm >>>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv: >>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid >>>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19 >>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 >>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18 >>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 >>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds >>>> [grsacc19:28754] mca:base:select:( plm) Querying component [rsh] >>>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >>>> [grsacc19:28754] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [grsacc19:28754] mca:base:select:( plm) Selected component [rsh] >>>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL >>>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm >>>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm >>>> [grsacc18:16648] mca:base:select:( plm) Querying component [rsh] >>>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >>>> [grsacc18:16648] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [grsacc18:16648] mca:base:select:( plm) Selected component [rsh] >>>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL >>>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm >>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >>>> [[6253,0],2] >>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >>>> [[6253,0],2] on node grsacc18 >>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for >>>> daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974 >>>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2] >>>> [grsacc20:04511] 1 more process has sent help message >>>> help-mpi-btl-openib.txt / no active ports found >>>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>> all help / error messages >>>> [grsacc20:04511] 1 more process has sent help message >>>> help-mpi-btl-base.txt / btl:no-nics >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command >>>> from [[6253,0],2] >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for >>>> job [6253,2] >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for >>>> vpid 0 state RUNNING exit_code 0 >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job >>>> [6253,2] >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event >>>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job >>>> [6253,2] to [[6253,1],0] >>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit >>>> commands >>>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm >>>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm >>>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm >>>> >>>> >>>> >>>> >>>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote: >>>> >>>>> Found a bug in the Torque support - we were trying to connect to the MOM >>>>> again, which would hang (I imagine). I pushed a fix to the trunk (r29227) >>>>> and scheduled it to come to 1.7.3 if you want to try it again. >>>>> >>>>> >>>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran >>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>> >>>>>> Dear Ralph, >>>>>> >>>>>> This is the output I get when I execute with the verbose option. >>>>>> >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from >>>>>> [[23526,1],0] >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>>>> [[23526,0],2] >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>>>> [[23526,0],2] to node grsacc17/1-4 >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>>>> [[23526,0],3] >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>>>> [[23526,0],3] to node grsacc17/0-5 >>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm >>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv: >>>>>> orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid >>>>>> <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri >>>>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5 >>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only >>>>>> one event_base_loop can run on each event_base at once. >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit >>>>>> commands >>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm >>>>>> >>>>>> Says something? >>>>>> >>>>>> Best, >>>>>> Suraj >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote: >>>>>> >>>>>>> I'll still need to look at the intercomm_create issue, but I just >>>>>>> tested both the trunk and current 1.7.3 branch for "add-host" and both >>>>>>> worked just fine. This was on my little test cluster which only has rsh >>>>>>> available - no Torque. >>>>>>> >>>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some >>>>>>> debug output as to the problem. >>>>>>> >>>>>>> >>>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran >>>>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Dear all, >>>>>>>>> >>>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to >>>>>>>>> check if it works for my case and as of revision 29215, it works for >>>>>>>>> the original case I reported. Although it works, I still see the >>>>>>>>> following in the output. Does it mean anything? >>>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0] >>>>>>>> >>>>>>>> Yes - it means we don't quite have this right yet :-( >>>>>>>> >>>>>>>>> >>>>>>>>> However, on another topic relevant to my use case, I have another >>>>>>>>> problem to report. I am having problems using the "add-host" info to >>>>>>>>> the MPI_Comm_spawn() when MPI is compiled with support for Torque >>>>>>>>> resource manager. This problem is totally new in the 1.7 series and >>>>>>>>> it worked perfectly until 1.6.5 >>>>>>>>> >>>>>>>>> Basically, I am working on implementing dynamic resource management >>>>>>>>> facilities in the Torque/Maui batch system. Through a new tm call, an >>>>>>>>> application can get new resources for a job. >>>>>>>> >>>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to >>>>>>>> support precisely that operation. It allows an application to request >>>>>>>> that we dynamically obtain additional resources during execution >>>>>>>> (e.g., as part of a Comm_spawn call via an info_key). We originally >>>>>>>> implemented this with Slurm, but you could add the calls into the >>>>>>>> Torque component as well if you like. >>>>>>>> >>>>>>>> This is in the trunk now - will come over to 1.7.4 >>>>>>>> >>>>>>>> >>>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new >>>>>>>>> hosts. With my extended torque/maui batch system, I was able to >>>>>>>>> perfectly use the "add-host" info argument to MPI_Comm_spawn() to >>>>>>>>> spawn new processes on these hosts. Since MPI and Torque refer to the >>>>>>>>> hosts through the nodeids, I made sure that OpenMPI uses the correct >>>>>>>>> nodeid's for these new hosts. >>>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the >>>>>>>>> Intercomm_merge problem, I could not really run a real application to >>>>>>>>> its completion. >>>>>>>>> >>>>>>>>> While this is now fixed in the trunk, I found that, however, when >>>>>>>>> using the "add-host" info argument, everything collapses after >>>>>>>>> printing out the following error. >>>>>>>>> >>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only >>>>>>>>> one event_base_loop can run on each event_base at once. >>>>>>>> >>>>>>>> I'll take a look - probably some stale code that hasn't been updated >>>>>>>> yet for async ORTE operations >>>>>>>> >>>>>>>>> >>>>>>>>> And due to this, I am still not really able to run my application! I >>>>>>>>> also compiled the MPI without any Torque/PBS support and just used >>>>>>>>> the "add-host" argument normally. Again, this worked perfectly in >>>>>>>>> 1.6.5. But in the 1.7 series, it works but after printing out the >>>>>>>>> following error. >>>>>>>>> >>>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] >>>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] >>>>>>>> >>>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we >>>>>>>> "illegally" re-enter libevent. The error again means we don't have >>>>>>>> Intercomm_create correct just yet. >>>>>>>> >>>>>>>> I'll see what I can do about this and get back to you >>>>>>>> >>>>>>>>> >>>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque >>>>>>>>> support, it runs after spitting the above lines. >>>>>>>>> >>>>>>>>> I would really appreciate some help on this, since I need these >>>>>>>>> features to actually test my case and (at least in my short >>>>>>>>> experience) no other MPI implementation seem friendly to such dynamic >>>>>>>>> scenarios. >>>>>>>>> >>>>>>>>> Thanks a lot! >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Suraj >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote: >>>>>>>>> >>>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all works >>>>>>>>>> for me. Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks George - much appreciated >>>>>>>>>>> >>>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> The test case was broken. I just pushed a fix. >>>>>>>>>>>> >>>>>>>>>>>> George. >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hangs with any np > 1 >>>>>>>>>>>>> >>>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the >>>>>>>>>>>>> underlying implementation >>>>>>>>>>>>> >>>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Does it hang when you run with -np 4? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sent from my phone. No type good. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one >>>>>>>>>>>>>>> difference - I only run it with np=1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca >>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must >>>>>>>>>>>>>>>>> have another network enabled. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if >>>>>>>>>>>>>>>> you only run with sm,self because the comm_spawn will fail >>>>>>>>>>>>>>>> with unreachable errors -- I just tested/proved this to >>>>>>>>>>>>>>>> myself). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an >>>>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without >>>>>>>>>>>>>>>>> xterm support. Instead try using the test case from the >>>>>>>>>>>>>>>>> trunk, the one committed by Ralph. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran >>>>>>>>>>>>>>>> with orte/test/mpi/intercomm_create.c, and that hangs for me >>>>>>>>>>>>>>>> as well: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> George. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> George -- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your >>>>>>>>>>>>>>>>>> attached test case hangs: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 4] >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 5] >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 6] >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 7] >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca >>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch >>>>>>>>>>>>>>>>>>> that addresses the MPI_Intercomm issue at the MPI level. It >>>>>>>>>>>>>>>>>>> should be applied after removal of 29166. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner >>>>>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and >>>>>>>>>>>>>>>>>>> doing a clean disconnect. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jeff Squyres >>>>>>>>>> jsquy...@cisco.com >>>>>>>>>> For corporate legal information go to: >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel