Eugene,
I got some free time (yeh haw) and took a look at the OB1 PML in order
to fix the issue. I think I found the problem, as I'm unable to
reproduce this error. Can you please give it a try with 20946 and
20947 but without 20944?
Thanks,
george.
On Apr 6, 2009, at 14:49 , Eugene Loh wrote:
This strikes me as very reasonable. That is, the PML should be
fixed, but to keep the issue from being a 1.3.2 blocker we should
bump the mpool_sm_min_size default back up again so that 1.3.2 is no
worse than 1.3.1.
I put back SVN r20944 with this change. osu_bw now runs (for me).
I filed CMR 1870 to add this change to the 1.3.2 branch. I guess I
need a code review. Could someone review the code for r20944 and
annotate the CMR? It's a one-line/several-character change that
bumps the min default size from 0 to 64M.
At this point, I assume 1857 is no longer a blocker, but in the long
term the PML should be fixed.
Lenny Verkhovsky wrote:
Changing default value is an easy fix. This fix will not add new
possible bugs/dead locks/pathes where noone has gone before on the
PML level.
This fix can be added to Open MPI 1.3 that currently is blocked due
to OSU failure.
PML fix can be done later (IMHO)
On Sat, Apr 4, 2009 at 1:46 AM, Eugene Loh <eugene....@sun.com>
wrote:
What's next on this ticket? It's supposed to be a blocker. Again,
the issue is that osu_bw deluges a receiver with rendezvous
messages, but the receiver does not have enough eager frags to
acknowledge them all. We see this now that the sizing of the mmap
file has changed and there's less headroom to grow the free lists.
Possible fixes are:
A) Just make the mmap file default size larger (though less
overkill than we used to have).
B) Fix the PML code that is supposed to deal with cases like this.
(At least I think the PML has code that's intended for this purpose.)
Eugene Loh wrote:
In osu_bw, process 0 pumps lots of Isend's to process 1, and
process 1 in turn sets up lots of matching Irecvs. Many messages
are in flight. The question is what happens when resources are
exhausted and OMPI cannot handle so much in-flight traffic. Let's
specifically consider the case of long, rendezvous messages. There
are at least two situations.
1) When the sender no longer has any fragments (nor can grow its
free list any more), it queues a send up with
add_request_to_send_pending() and somehow life is good. The PML
seems to handle this case "correctly".
2) When the receiver -- specifically
mca_pml_ob1_recv_request_ack_send_btl() -- no longer has any
fragments to send ACKs back to confirm readiness for rendezvous,
the resource-exhaustion signal travels up the call stack to
mca_pml_ob1_recv_request_ack_send(), who does a
MCA_PML_OB1_ADD_ACK_TO_PENDING(). In short, the PML adds the ACK
to pckt_pending. Somehow, this code path doesn't work.
The reason we see the problem now is that I added "autosizing" of
the shared-memory area. We used to mmap *WAY* too much shared-
memory for small-np jobs. (Yes, that's a subjective statement.)
Meanwhile, at large-np, we didn't mmap enough and jobs wouldn't
start. (Objective statement there.) So, I added heuristics to
size the shared area "appropriately". The heuristics basically
targetted the needs of MPI_Init(). If you want fragment free lists
to grow on demand after MPI_Init(), you now basically have to bump
mpool_sm_min_size up explicitly.
I'd like feedback on a fix. Here are two options:
A) Someone (could be I) increases the default resources. E.g., we
could start with a larger eager free list. Or, I could change
those "heuristics" to allow some amount of headroom for free lists
to grow on demand. Either way, I'd appreciate feedback on how big
to set these things.
B) Someone (not I, since I don't know how) fixes the ob1 PML to
handle scenario 2 above correctly.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel