On Mon, Jul 09, 2007 at 10:41:52AM -0400, Tim Prins wrote: > Gleb Natapov wrote: > > On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote: > >> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote: > >>> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote: > >>>> While looking into another problem I ran into an issue which made ob1 > >>>> segfault on me. Using gm, and running the test test_dan1 in the onesided > >>>> test suite, if I limit the gm freelist by too much, I get a segfault. > >>>> That is, > >>>> > >>>> mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1 > >>>> > >>>> works fine, but > >>>> > >>>> mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1 > >>> I cannot, unfortunately, reproduce this with openib BTL. > >>> > >>>> segfaults. Here is the relevant output from gdb: > >>>> > >>>> Program received signal SIGSEGV, Segmentation fault. > >>>> [Switching to Thread 1077541088 (LWP 15600)] > >>>> 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580, > >>>> hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267 > >>>> 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, > >>>> sizeof(mca_pml_ob1_fin_hdr_t)); > >>> can you send me what's inside bml_btl? > >> It turns out that the order of arguments to mca_pml_ob1_send_fin was > >> wrong. I > >> fixed this in r15304. But now we hang instead of segfault, and have both > >> processes just looping through opal_progress. I really don't what to look > >> for. Any hints? > >> > > Can you look in gdb at mca_pml_ob1.rdma_pending? > Yeah, rank 0 has nothing on the list, and rank 1 has 48 things. Do you run both ranks on the same node? Can you try to run them on different node?
-- Gleb.