[OMPI devel] collectives / #1944 progress

Jeff Squyres Wed, 1 Jul 2009 11:52:10 -0400

It looks like Eugene's and George's fixes on coll sm resolve all theknown hangs. We still have flow control issues, but that istemporarily being solved by the coll sync component. To be clear:running with coll_sync_barrier_before 1000 seems to resolve all knownhangs, and we think that this is good enough for v1.3.3. We shouldCMR whatever is necessary to the v1.3 branch.

==> We should also default coll_sync_barrier_before to 1000 for v1.3.3(i.e., ensure sync activates itself).


For the future, we have a two pronged plan:

1. Clean up the sm btl:
  1a. Remove all dead code.

1b. Resize free_list_max and fifo_size MCA params to effect goodenough flow control.1c. Possibly: convert from FIFO's to linked lists (for futuremaintenance purposes, not necessarily to fix problems).

2. Test, enable, and continue to develop the coll sm module. Usingthis module will avoid the p2p unexpected message queue explosion thatwe're seeing (at least for collectives with short messages). Itnominally has broadcast, barrier, reduce, and allreduce implemented.We really only need to a) test the heck outta them, and b) add gather,scatter, scan, and exscan to the list. All the other collectiveoperations have implicit synchronization and won't run into theunbounded unexpected queue issues. The bcast loop reproducer seemedto work fine for me with the coll sm, but it segv'ed immediately forRalph. So clearly some work needs to be done.


We think that these two items should be the main features for 1.3.4.

--
Jeff Squyres
Cisco Systems

[OMPI devel] collectives / #1944 progress

Reply via email to