Hi Ralph,
I'm also puzzled :-)
Here is what I did today :
* download the latest nightly build (openmpi-1.7a1r22241)
* untar it
* patch it with my "ORTE_RELAY_DELAY" patch
* build it directly on the cluster (running FC11) with :
./configure --platform=contrib/platform/lanl/tlcc/debug-nopanasas
--prefix=<some path in my home>
make && make install
* deactivate oob_tcp_if_include=ib0 in openmpi-mca-params.conf (IPoIB is
broken on my machine) and run with :
salloc -N 10 mpirun ./helloworld
And .. still the same behaviour : ok by default, deadlock with the typical
stack when setting ORTE_RELAY_DELAY to 1.
About my previous e-mail, I was wrong about all components having a 0
priority : it was based on default parameters reported by "ompi_info -a |
grep routed". It seems that the truth is not always in ompi_info ...
Sylvain
On Fri, 27 Nov 2009, Ralph Castain wrote:
On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:
Hi Ralph,
I tried with the trunk and it makes no difference for me.
Strange
Looking at potential differences, I found out something strange. The bug may have
something to do with the "routed" framework. I can reproduce the bug with
binomial and direct, but not with cm and linear (you disabled the build of the latest in
your configure options -- why ?).
You won't with cm because there is no relay. Likewise, direct doesn't have a
relay - so I'm really puzzled how you can see this behavior when using the
direct component???
I disable components in my build to save memory. Every component we open costs
us memory that may or may not be recoverable during the course of execution.
Btw, all components have a 0 priority and none is defined to be the default
component. Which one is the default then ? binomial (as the first in
alphabetical order) ?
I believe you must have a severely corrupted version of the code. The binomial
component has priority 70 so it will be selected as the default.
Linear has priority 40, though it will only be selected if you say ^binomial.
CM and radix have special selection code in them so they will only be selected
when specified.
Direct and slave have priority 0 to ensure they will only be selected when
specified
Can you check which one you are using and try with binomial explicitely chosen ?
I am using binomial for all my tests
From what you are describing, I think you either have a corrupted copy of the
code, are picking up mis-matched versions, or something strange as your
experiences don't match what anyone else is seeing.
Remember, the phase you are discussing here has nothing to do with the native launch
environment. This is dealing with the relative timing of the application launch versus
relaying the launch message itself - i.e., the daemons are already up and running before
any of this starts. Thus, this "problem" has nothing to do with how we launch
the daemons. So, if it truly were a problem in the code, we would see it on every
environment - torque, slurm, ssh, etc.
We routinely launch jobs spanning hundreds to thousands of nodes without
problem. If this timing problem was as you have identified, then we would see
this constantly. Yet nobody is seeing it, and I cannot reproduce it even with
your reproducer.
I honestly don't know what to suggest at this point. Any chance you are picking up
mis-matched OMPI versions are your backend nodes or something? Tried fresh checkouts of
the code? Is this a code base you have modified, or are you seeing this with the
"stock" code from the repo?
Just fishing at this point - can't find anything wrong! :-/
Ralph
Thanks for your time,
Sylvain
On Thu, 26 Nov 2009, Ralph Castain wrote:
Hi Sylvain
Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in
ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the
delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn,
a "hello world" app that calls MPI_Init immediately upon execution.
So I have to conclude this is a problem in your setup/config. Are you sure you
didn't --enable-progress-threads?? That is the only way I can recreate this
behavior.
I plan to modify the relay/message processing method anyway to clean it up. But
there doesn't appear to be anything wrong with the current code.
Ralph
On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
Hi Ralph,
Thanks for your efforts. I will look at our configuration and see how it may
differ from ours.
Here is a patch which helps reproducing the bug even with a small number of
nodes.
diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100
+++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100
@@ -126,6 +126,13 @@
ORTE_ERROR_LOG(ret);
goto CLEANUP;
}
+ { /* Add delay to reproduce bug */
+ char * str = getenv("ORTE_RELAY_DELAY");
+ int sec = str ? atoi(str) : 0;
+ if (sec) {
+ sleep(sec);
+ }
+ }
}
CLEANUP:
Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.
During our experiments, the bug disappeared when we added a delay before
calling MPI_Init. So, configurations where processes are launched slowly or
take some time before MPI_Init should be immune to this bug.
We usually reproduce the bug with one ppn (faster to spawn).
Sylvain
On Thu, 19 Nov 2009, Ralph Castain wrote:
Hi Sylvain
I've spent several hours trying to replicate the behavior you described on
clusters up to a couple of hundred nodes (all running slurm), without success.
I'm becoming increasingly convinced that this is a configuration issue as
opposed to a code issue.
I have enclosed the platform file I use below. Could you compare it to your
configuration? I'm wondering if there is something critical about the config
that may be causing the problem (perhaps we have a problem in our default
configuration).
Also, is there anything else you can tell us about your configuration? How many
ppn triggers it, or do you always get the behavior every time you launch over a
certain number of nodes?
Meantime, I will look into this further. I am going to introduce a "slow down"
param that will force the situation you encountered - i.e., will ensure that the relay is
still being sent when the daemon receives the first collective input. We can then use
that to try and force replication of the behavior you are encountering.
Thanks
Ralph
enable_dlopen=no
enable_pty_support=no
with_blcr=no
with_openib=yes
with_memory_manager=no
enable_mem_debug=yes
enable_mem_profile=no
enable_debug_symbols=yes
enable_binaries=yes
with_devel_headers=yes
enable_heterogeneous=no
enable_picky=yes
enable_debug=yes
enable_shared=yes
enable_static=yes
with_slurm=yes
enable_contrib_no_build=libnbc,vt
enable_visibility=yes
enable_memchecker=no
enable_ipv6=no
enable_mpi_f77=no
enable_mpi_f90=no
enable_mpi_cxx=no
enable_mpi_cxx_seek=no
enable_mca_no_build=pml-dr,pml-crcp2,crcp
enable_io_romio=no
On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:
On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:
Thank you Ralph for this precious help.
I setup a quick-and-dirty patch basically postponing process_msg (hence
daemon_collective) until the launch is done. In process_msg, I therefore
requeue a process_msg handler and return.
That is basically the idea I proposed, just done in a slightly different place
In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I
don't think that blocking calls like the one in daemon_collective should be allowed. This
also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead
to interlocking.]
Well, that would be problematic - you will find "progressed_wait" used
repeatedly in the code. Removing them all would take a -lot- of effort and a major
rewrite. I'm not yet convinced it is required. There may be something strange in how you
are setup, or your cluster - like I said, this is the first report of a problem we have
had, and people with much bigger slurm clusters have been running this code every day for
over a year.
If you have time doing a nicer patch, it would be great and I would be happy to
test it. Otherwise, I will try to implement your idea properly next week (with
my limited knowledge of orted).
Either way is fine - I'll see if I can get to it.
Thanks
Ralph
For the record, here is the patch I'm currently testing at large scale :
diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c
--- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100
+++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100
@@ -687,14 +687,6 @@
opal_list_append(&orte_local_jobdata, &jobdat->super);
}
- /* it may be possible to get here prior to having actually finished
processing our
- * local launch msg due to the race condition between different nodes and
when
- * they start their individual procs. Hence, we have to first ensure that
we
- * -have- finished processing the launch msg, or else we won't know whether
- * or not to wait before sending this on
- */
- ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
-
/* unpack the collective type */
n = 1;
if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, &jobdat->collective_type, &n,
ORTE_GRPCOMM_COLL_T))) {
@@ -894,6 +886,28 @@
proc = &mev->sender;
buf = mev->buffer;
+
+ jobdat = NULL;
+ for (item = opal_list_get_first(&orte_local_jobdata);
+ item != opal_list_get_end(&orte_local_jobdata);
+ item = opal_list_get_next(item)) {
+ jobdat = (orte_odls_job_t*)item;
+
+ /* is this the specified job? */
+ if (jobdat->jobid == proc->jobid) {
+ break;
+ }
+ }
+ if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
+ /* it may be possible to get here prior to having actually finished
processing our
+ * local launch msg due to the race condition between different nodes
and when
+ * they start their individual procs. Hence, we have to first ensure
that we
+ * -have- finished processing the launch msg. Requeue this event until
it is done.
+ */
+ int tag = &mev->tag;
+ ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
+ return;
+ }
/* is the sender a local proc, or a daemon relaying the collective? */
if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {
Sylvain
On Thu, 19 Nov 2009, Ralph Castain wrote:
Very strange. As I said, we routinely launch jobs spanning several hundred
nodes without problem. You can see the platform files for that setup in
contrib/platform/lanl/tlcc
That said, it is always possible you are hitting some kind of race condition we
don't hit. In looking at the code, one possibility would be to make all the
communications flow through the daemon cmd processor in orte/orted_comm.c. This
is the way it used to work until I reorganized the code a year ago for other
reasons that never materialized.
Unfortunately, the daemon collective has to wait until the local launch cmd has
been completely processed so it can know whether or not to wait for
contributions from local procs before sending along the collective message, so
this kinda limits our options.
About the only other thing you could do would be to not send the relay at all until
-after- processing the local launch cmd. You can then remove the "wait" in the
daemon collective as you will know how many local procs are involved, if any.
I used to do it that way and it guarantees it will work. The negative is that
we lose some launch speed as the next nodes in the tree don't get the launch
message until this node finishes launching all its procs.
The way around that, of course, would be to:
1. process the launch message, thus extracting the number of any local procs
and setting up all data structures...but do -not- launch the procs at this time
(as this is what takes all the time)
2. send the relay - the daemon collective can now proceed without a "wait" in it
3. now launch the local procs
It would be a fairly simple reorganization of the code in the orte/mca/odls
area. I can do it this weekend if you like, or you can do it - either way is
fine, but if you do it, please contribute it back to the trunk.
Ralph
On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:
I would say I use the default settings, i.e. I don't set anything "special" at
configure.
I'm launching my processes with SLURM (salloc + mpirun).
Sylvain
On Wed, 18 Nov 2009, Ralph Castain wrote:
How did you configure OMPI?
What launch mechanism are you using - ssh?
On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
I don't think so, and I'm not doing it explicitely at least. How do I know ?
Sylvain
On Tue, 17 Nov 2009, Ralph Castain wrote:
We routinely launch across thousands of nodes without a problem...I have never
seen it stick in this fashion.
Did you build and/or are using ORTE threaded by any chance? If so, that
definitely won't work.
On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
Hi all,
We are currently experiencing problems at launch on the 1.5 branch on
relatively large number of nodes (at least 80). Some processes are not spawned
and orted processes are deadlocked.
When MPI processes are calling MPI_Init before send_relay is complete, the
send_relay function and the daemon_collective function are doing a nice
interlock :
Here is the scenario :
send_relay
performs the send tree :
orte_rml_oob_send_buffer
orte_rml_oob_send
opal_wait_condition
Waiting on completion from send thus calling opal_progress()
opal_progress()
But since a collective request arrived from the network, entered :
daemon_collective
However, daemon_collective is waiting for the job to be initialized (wait on
jobdat->launch_msg_processed) before continuing, thus calling :
opal_progress()
At this time, the send may complete, but since we will never go back to
orte_rml_oob_send, we will never perform the launch (setting
jobdat->launch_msg_processed to 1).
I may try to solve the bug (this is quite a top priority problem for me), but
maybe people who are more familiar with orted than I am may propose a nice and
clean solution ...
For those who like real (and complete) gdb stacks, here they are :
#0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
#1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0,
tv=0x7fff0d977880) at poll.c:167
#2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at
event.c:823
#3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#4 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) at
grpcomm_bad_module.c:696
#6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at
grpcomm_bad_module.c:901
#7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at
event.c:839
#9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#10 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) at
grpcomm_bad_module.c:696
#12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at
grpcomm_bad_module.c:901
#13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at
event.c:839
#15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#16 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) at
grpcomm_bad_module.c:696
#18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at
grpcomm_bad_module.c:901
#19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at
event.c:839
#21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#22 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at
../../../../opal/threads/condition.h:99
#24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0,
iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
#25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0,
buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270
#26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at
orted/orted_comm.c:127
#27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1,
data=0x965fc0) at orted/orted_comm.c:308
#28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at
event.c:839
#30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746
#31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682
#32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at
orted/orted_main.c:769
#33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62
Thanks in advance,
Sylvain
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel