Thank you Ralph for this precious help.
I setup a quick-and-dirty patch basically postponing process_msg (hence
daemon_collective) until the launch is done. In process_msg, I therefore
requeue a process_msg handler and return.
In this "all-must-be-non-blocking-and-done-through-opal_progress"
algorithm, I don't think that blocking calls like the one in
daemon_collective should be allowed. This also applies to the blocking one
in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.]
If you have time doing a nicer patch, it would be great and I would be
happy to test it. Otherwise, I will try to implement your idea properly
next week (with my limited knowledge of orted).
For the record, here is the patch I'm currently testing at large scale :
diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c
--- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100
+++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100
@@ -687,14 +687,6 @@
opal_list_append(&orte_local_jobdata, &jobdat->super);
}
- /* it may be possible to get here prior to having actually finished
processing our
- * local launch msg due to the race condition between different nodes and
when
- * they start their individual procs. Hence, we have to first ensure that
we
- * -have- finished processing the launch msg, or else we won't know whether
- * or not to wait before sending this on
- */
- ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
-
/* unpack the collective type */
n = 1;
if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, &jobdat->collective_type,
&n, ORTE_GRPCOMM_COLL_T))) {
@@ -894,6 +886,28 @@
proc = &mev->sender;
buf = mev->buffer;
+
+ jobdat = NULL;
+ for (item = opal_list_get_first(&orte_local_jobdata);
+ item != opal_list_get_end(&orte_local_jobdata);
+ item = opal_list_get_next(item)) {
+ jobdat = (orte_odls_job_t*)item;
+
+ /* is this the specified job? */
+ if (jobdat->jobid == proc->jobid) {
+ break;
+ }
+ }
+ if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
+ /* it may be possible to get here prior to having actually finished
processing our
+ * local launch msg due to the race condition between different nodes
and when
+ * they start their individual procs. Hence, we have to first ensure
that we
+ * -have- finished processing the launch msg. Requeue this event until
it is done.
+ */
+ int tag = &mev->tag;
+ ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
+ return;
+ }
/* is the sender a local proc, or a daemon relaying the collective? */
if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {
Sylvain
On Thu, 19 Nov 2009, Ralph Castain wrote:
Very strange. As I said, we routinely launch jobs spanning several
hundred nodes without problem. You can see the platform files for that
setup in contrib/platform/lanl/tlcc
That said, it is always possible you are hitting some kind of race
condition we don't hit. In looking at the code, one possibility would be
to make all the communications flow through the daemon cmd processor in
orte/orted_comm.c. This is the way it used to work until I reorganized
the code a year ago for other reasons that never materialized.
Unfortunately, the daemon collective has to wait until the local launch
cmd has been completely processed so it can know whether or not to wait
for contributions from local procs before sending along the collective
message, so this kinda limits our options.
About the only other thing you could do would be to not send the relay
at all until -after- processing the local launch cmd. You can then
remove the "wait" in the daemon collective as you will know how many
local procs are involved, if any.
I used to do it that way and it guarantees it will work. The negative is
that we lose some launch speed as the next nodes in the tree don't get
the launch message until this node finishes launching all its procs.
The way around that, of course, would be to:
1. process the launch message, thus extracting the number of any local
procs and setting up all data structures...but do -not- launch the procs
at this time (as this is what takes all the time)
2. send the relay - the daemon collective can now proceed without a
"wait" in it
3. now launch the local procs
It would be a fairly simple reorganization of the code in the
orte/mca/odls area. I can do it this weekend if you like, or you can do
it - either way is fine, but if you do it, please contribute it back to
the trunk.
Ralph
On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:
I would say I use the default settings, i.e. I don't set anything "special" at
configure.
I'm launching my processes with SLURM (salloc + mpirun).
Sylvain
On Wed, 18 Nov 2009, Ralph Castain wrote:
How did you configure OMPI?
What launch mechanism are you using - ssh?
On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
I don't think so, and I'm not doing it explicitely at least. How do I know ?
Sylvain
On Tue, 17 Nov 2009, Ralph Castain wrote:
We routinely launch across thousands of nodes without a problem...I have never
seen it stick in this fashion.
Did you build and/or are using ORTE threaded by any chance? If so, that
definitely won't work.
On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
Hi all,
We are currently experiencing problems at launch on the 1.5 branch on
relatively large number of nodes (at least 80). Some processes are not spawned
and orted processes are deadlocked.
When MPI processes are calling MPI_Init before send_relay is complete, the
send_relay function and the daemon_collective function are doing a nice
interlock :
Here is the scenario :
send_relay
performs the send tree :
orte_rml_oob_send_buffer
orte_rml_oob_send
> opal_wait_condition
Waiting on completion from send thus calling opal_progress()
> opal_progress()
But since a collective request arrived from the network, entered :
> daemon_collective
However, daemon_collective is waiting for the job to be initialized (wait on
jobdat->launch_msg_processed) before continuing, thus calling :
> opal_progress()
At this time, the send may complete, but since we will never go back to
orte_rml_oob_send, we will never perform the launch (setting
jobdat->launch_msg_processed to 1).
I may try to solve the bug (this is quite a top priority problem for me), but
maybe people who are more familiar with orted than I am may propose a nice and
clean solution ...
For those who like real (and complete) gdb stacks, here they are :
#0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
#1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0,
tv=0x7fff0d977880) at poll.c:167
#2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at
event.c:823
#3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#4 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) at
grpcomm_bad_module.c:696
#6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at
grpcomm_bad_module.c:901
#7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at
event.c:839
#9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#10 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) at
grpcomm_bad_module.c:696
#12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at
grpcomm_bad_module.c:901
#13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at
event.c:839
#15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#16 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) at
grpcomm_bad_module.c:696
#18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at
grpcomm_bad_module.c:901
#19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at
event.c:839
#21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#22 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at
../../../../opal/threads/condition.h:99
#24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0,
iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
#25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0,
buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270
#26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at
orted/orted_comm.c:127
#27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1,
data=0x965fc0) at orted/orted_comm.c:308
#28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at
event.c:839
#30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746
#31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682
#32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at
orted/orted_main.c:769
#33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62
Thanks in advance,
Sylvain
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel