On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:
> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> > On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > > While looking into another problem I ran into an issue which made ob1
> > > segfault on me. Using gm, and running the test test_dan1 in the onesided
> > > test suite, if I limit the gm freelist by too much, I get a segfault.
> > > That is,
> > >
> > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> > >
> > > works fine, but
> > >
> > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
> >
> > I cannot, unfortunately, reproduce this with openib BTL.
> >
> > > segfaults. Here is the relevant output from gdb:
> > >
> > > Program received signal SIGSEGV, Segmentation fault.
> > > [Switching to Thread 1077541088 (LWP 15600)]
> > > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
> > >     hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> > > 267         MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
> > > sizeof(mca_pml_ob1_fin_hdr_t));
> >
> > can you send me what's inside bml_btl?
> 
> It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I 
> fixed this in r15304. But now we hang instead of segfault, and have both 
> processes just looping through opal_progress. I really don't what to look 
> for. Any hints?
> 
Can you look in gdb at mca_pml_ob1.rdma_pending?

--
                        Gleb.

Reply via email to