Re: [OMPI devel] trac 1857: SM btl hangs when msg >=4k

George Bosilca Mon, 6 Apr 2009 17:58:10 -0400

Eugene,

I got some free time (yeh haw) and took a look at the OB1 PML in orderto fix the issue. I think I found the problem, as I'm unable toreproduce this error. Can you please give it a try with 20946 and20947 but without 20944?


  Thanks,
    george.

On Apr 6, 2009, at 14:49 , Eugene Loh wrote:

This strikes me as very reasonable. That is, the PML should befixed, but to keep the issue from being a 1.3.2 blocker we shouldbump the mpool_sm_min_size default back up again so that 1.3.2 is noworse than 1.3.1.
I put back SVN r20944 with this change.  osu_bw now runs (for me).
I filed CMR 1870 to add this change to the 1.3.2 branch. I guess Ineed a code review. Could someone review the code for r20944 andannotate the CMR? It's a one-line/several-character change thatbumps the min default size from 0 to 64M.
At this point, I assume 1857 is no longer a blocker, but in the longterm the PML should be fixed.
Lenny Verkhovsky wrote:
Changing default value is an easy fix. This fix will not add newpossible bugs/dead locks/pathes where noone has gone before on thePML level.This fix can be added to Open MPI 1.3 that currently is blocked dueto OSU failure.
PML fix can be done later (IMHO)
On Sat, Apr 4, 2009 at 1:46 AM, Eugene Loh <[email protected]>wrote:What's next on this ticket? It's supposed to be a blocker. Again,the issue is that osu_bw deluges a receiver with rendezvousmessages, but the receiver does not have enough eager frags toacknowledge them all. We see this now that the sizing of the mmapfile has changed and there's less headroom to grow the free lists.Possible fixes are:
A) Just make the mmap file default size larger (though lessoverkill than we used to have).B) Fix the PML code that is supposed to deal with cases like this.(At least I think the PML has code that's intended for this purpose.)
Eugene Loh wrote:
In osu_bw, process 0 pumps lots of Isend's to process 1, andprocess 1 in turn sets up lots of matching Irecvs. Many messagesare in flight. The question is what happens when resources areexhausted and OMPI cannot handle so much in-flight traffic. Let'sspecifically consider the case of long, rendezvous messages. Thereare at least two situations.
1) When the sender no longer has any fragments (nor can grow itsfree list any more), it queues a send up withadd_request_to_send_pending() and somehow life is good. The PMLseems to handle this case "correctly".
2) When the receiver -- specificallymca_pml_ob1_recv_request_ack_send_btl() -- no longer has anyfragments to send ACKs back to confirm readiness for rendezvous,the resource-exhaustion signal travels up the call stack tomca_pml_ob1_recv_request_ack_send(), who does aMCA_PML_OB1_ADD_ACK_TO_PENDING(). In short, the PML adds the ACKto pckt_pending. Somehow, this code path doesn't work.
The reason we see the problem now is that I added "autosizing" ofthe shared-memory area. We used to mmap *WAY* too much shared-memory for small-np jobs. (Yes, that's a subjective statement.)Meanwhile, at large-np, we didn't mmap enough and jobs wouldn'tstart. (Objective statement there.) So, I added heuristics tosize the shared area "appropriately". The heuristics basicallytargetted the needs of MPI_Init(). If you want fragment free liststo grow on demand after MPI_Init(), you now basically have to bumpmpool_sm_min_size up explicitly.
I'd like feedback on a fix.  Here are two options:
A) Someone (could be I) increases the default resources. E.g., wecould start with a larger eager free list. Or, I could changethose "heuristics" to allow some amount of headroom for free liststo grow on demand. Either way, I'd appreciate feedback on how bigto set these things.
B) Someone (not I, since I don't know how) fixes the ob1 PML tohandle scenario 2 above correctly.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] trac 1857: SM btl hangs when msg >=4k

Reply via email to