Re: [OMPI devel] SM init failures

Eugene Loh Tue, 31 Mar 2009 15:06:44 -0400

Jeff Squyres wrote:

On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote:
> FWIW, George found what looks like a race condition in the sm init
> code today -- it looks like we don't call maffinity anywhere in the
> sm  btl startup, so we're not actually guaranteed that the memory is
> local  to any particular process(or) (!).  This race shouldn't cause
> segvs, though; it should only mean that memory is potentiallyfarther
> away  than we intended.

Is this that business that came up recently on one of these mail lists
about setting the memory node to -1 rather than using the value we  know
it should be?  In mca_mpool_sm_alloc(), I do see a call to
opal_maffinity_base_bind().
No, it was a different thing -- but we missed the call to maffinityin mpool sm. So that might make George's point moot (I see he stillhasn't chimed in yet on this thread, perhaps that's why ;-) ).
To throw a little flame on the fire -- I notice the following from anMTT run last night:
[svbu-mpi004:17172] *** Process received signal ***
[svbu-mpi004:17172] Signal: Segmentation fault (11)
[svbu-mpi004:17172] Signal code: Invalid permissions (2)
[svbu-mpi004:17172] Failing at address: 0x2a98a3f080
[svbu-mpi004:17172] [ 0] /lib64/tls/libpthread.so.0 [0x2a960695b0]
[svbu-mpi004:17172] [ 1] /home/jsquyres/bogus/lib/openmpi/mca_btl_sm.so [0x2a97f22619][svbu-mpi004:17172] [ 2] /home/jsquyres/bogus/lib/openmpi/mca_btl_sm.so [0x2a97f225ee][svbu-mpi004:17172] [ 3] /home/jsquyres/bogus/lib/openmpi/mca_btl_sm.so [0x2a97f22946][svbu-mpi004:17172] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so.0(opal_progress+0xa9) [0x2a95bbc078][svbu-mpi004:17172] [ 5] /home/jsquyres/bogus/lib/libmpi.so.0[0x2a95831324][svbu-mpi004:17172] [ 6] /home/jsquyres/bogus/lib/libmpi.so.0[0x2a9583185b][svbu-mpi004:17172] [ 7] /home/jsquyres/bogus/lib/openmpi/mca_coll_tuned.so [0x2a987e45be][svbu-mpi004:17172] [ 8] /home/jsquyres/bogus/lib/openmpi/mca_coll_tuned.so [0x2a987f160b][svbu-mpi004:17172] [ 9] /home/jsquyres/bogus/lib/openmpi/mca_coll_tuned.so [0x2a987e4c2e][svbu-mpi004:17172] [10] /home/jsquyres/bogus/lib/libmpi.so.0(PMPI_Barrier+0xd7) [0x2a9585987f][svbu-mpi004:17172] [11] src/MPI_Type_extent_types_c(main+0xa20)[0x402f88][svbu-mpi004:17172] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x2a9618e3fb]
[svbu-mpi004:17172] [13] src/MPI_Type_extent_types_c [0x4024da]
[svbu-mpi004:17172] *** End of error message ***
Notice the "invalid permissions" message. I didn't notice thatbefore, but perhaps I wasn't looking.
I also see that this is under coll_tuned, not coll_hierarch (i.e.,*not* during MPI_INIT -- it's in a barrier).

Yes, actually these happen "a lot". (I've been spending time looking atIU_Sif/r20880 MTT stack traces.)

If the stack trace has MPI_Init in it, it seems to be going throughmca_coll_hierarch.

Otherwise, the seg fault is in a collective call as you note -- could beMPI_Allgather, Barrier, Bcast, and I imagine there are others -- thenmca_coll_tuned and eventually down to the sm BTL.

There are also quite a bit of orphaned(?) stack traces. Just a segfaultand a single-level stack a la

[ 0] /lib/libpthread.so

> The central question is: does "first touch" mean both read and
> write? I.e., is the first process that either reads *or* writesto a> given location considered "first touch"? Or is it only the firstwrite?
So, maybe the strategy is to create the shared area, have each process
initialize its portion (FIFOs and free lists), have all processes  sync,
and then move on.  That way, you know all memory will be written by  the
appropriate owner before it's read by anyone else.  First-touch
ownership will be proper and we won't be dependent on zero-filledpages.
That was what George was going at yesterday -- there's a section inthe btl sm startup where you're setting up your own FIFOs. But thenthere's a section later where you're looking at your peers' FIFOs.There's no synchronization between these two points -- when you'relooking at your peers' FIFOs, you can tell if they're not setup yetby if the peer's FIFO is NULL or not. If it's NULL, you loop andtry again (until it's not NULL). This is what George thought mightbe "bad" from a maffinity standpoint -- but perhaps this is moot ifmpool sm is calling maffinity bind.

The thing I was wondering about was memory barriers. E.g., youinitialize stuff and then post the FIFO pointer. The other guy sees theFIFO pointer before the initialized memory.

The big question in my mind remains that we don't seem to know how to
reproduce the failure (segv) that we're trying to fix.  I, personally,
am reluctant to stick fixes into the code for problems I can't  observe.
Well, we *can* observe them -- I can reproduce them at a very lowrate in my MTT runs. We just don't understand the problem yet toknow how to reproduce them manually. To be clear: I'm violentlyagreeing with you: I want to fix the problem, but it would be muchmo' betta to *know* that we fixed the problem rather than "well, itdoesn't seem to be happening anymore." :-)

Re: [OMPI devel] SM init failures

Reply via email to