Hi Ralph,
Is it always the case that a proc launched local to mpirun uses mpirun
as the daemon? Our engine is not local to the host that mpirun is on,
it just happens to send the task back to that same host, and the grid
engine system handles all the process starting. If it is the case, is
there any particular flag or option I should be using with the orted on
the local host to indicate that it is local? Should I even be starting
an orted in this case, and if not, how would I start the proc? Also,
would it be safe to always decrease by one the maximum vpid used with
the orteds for the other tasks?
Thanks,
Maury
On 02/03/12 11:24, Ralph Castain wrote:
No brilliant suggestion - it sounds like your plugin isn't accurately computing
the number of daemons. When a proc is launched local to mpirun, it uses mpirun
as the daemon - it doesn't start another daemon on the same node. If your
plugin is doing so, or you are computing an extra daemon vpid that doesn't
truly exist, then you will have problems.
On Feb 3, 2012, at 11:27 AM, Maurice Feskanich wrote:
Hi Folks,
I'm having a problem with running mpirun when one of the tasks winds up running
on the same machine as mpirun.
A little background: our system uses a plugin to send tasks to grid engine. We
are currently using version 1.3.4 (we are not able to move to a newer version
because of the requirements of the tools that use our system.) Our code runs
on Solaris (both Sparc and X86), and Linux.
What we are seeing is that sometimes mpirun gets a segmentaion violation at
line 342 of plm_base_launch_support.c:
pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
Investigation has found that mev->sender.vpid is a number that is one greater
than the number of non-nil elements in the pdatorted array.
Here is the dbx stacktrace:
t@1 (l@1) program terminated by signal SEGV (no mapping at the fault address)
Current function is process_orted_launch_report (optimized)
342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(dbx) where
current thread: t@1
=>[1] process_orted_launch_report(fd = ???, opal_event = ???, data = ???) (optimized), at
0xffffffff7f491e60 (line ~342) in "plm_base_launch_support.c"
[2] event_process_active(base = ???) (optimized), at 0xffffffff7f241d04 (line ~651) in
"event.c"
[3] opal_event_base_loop(base = ???, flags = ???) (optimized), at 0xffffffff7f242178
(line ~823) in "event.c"
[4] opal_event_loop(flags = ???) (optimized), at 0xffffffff7f241f98 (line ~730) in
"event.c"
[5] opal_progress() (optimized), at 0xffffffff7f21d484 (line ~189) in
"opal_progress.c"
[6] orte_plm_base_daemon_callback(num_daemons = ???) (optimized), at 0xffffffff7f492388
(line ~459) in "plm_base_launch_support.c" [7] orte_plm_dream_spawn(0x8f0ac,
0x478560, 0x47868c, 0x12c, 0xffffffff7d305198, 0x8a8c0000), at 0xffffffff7d304a5c
[8] orterun(argc = 11, argv = 0xffffffff7fffede8), line 748 in "orterun.c"
[9] main(argc = 11, argv = 0xffffffff7fffede8), line 13 in "main.c"
The vpids we use when we start the orteds are 1-based, but the pdatorted array
is zero-based.
Any help anyone can provide would be very appreciated. Please don't hesitate
to ask questions.
Thanks,
Maury Feskanich
Oracle, Inc.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel