There should be no datatype attached to the barrier, so it is normal you get 
the zero values in the convertor.

Something weird is definitively going on. As there is no data to be sent, the 
opal_convertor_set_position function is supposed to trigger the special path, 
mark the convertor as completed and return successfully. However, this seems 
not to be the case anymore as in your backtrace I see the call to 
opal_convertor_set_position_nocheck, which only happens if the above described 
test fails.

I had some doubts about r26597, but I don't have time to check into it until 
Monday. Maybe you can remove it and se if you continue to have the same 
segfault.

  george.

On Jun 15, 2012, at 01:24 , Eugene Loh wrote:

> I see a segfault show up in trunk testing starting with r26598 when tests like
> 
>    ibm  collective/struct_gatherv
>    intel src/MPI_Type_free_[types|pending_msg]_[f|c]
> 
> are run over openib.  Here is a typical stack trace:
> 
>   opal_convertor_create_stack_at_begining(convertor = 0x689730, sizes), line 
> 404 in "opal_convertor.c"
>   opal_convertor_set_position_nocheck(convertor = 0x689730, position), line 
> 423 in "opal_convertor.c"
>   opal_convertor_set_position(convertor = 0x689730, position = 
> 0x7fffc36e0bf0), line 321 in "opal_convertor.h"
>   mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, size = 0), 
> line 485 in "pml_ob1_sendreq.c"
>   mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in 
> "pml_ob1_sendreq.h"
>   mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in 
> "pml_ob1_sendreq.h"
>   mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = -16, 
> sendmode = MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in 
> "pml_ob1_isend.c"
>   ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, sdatatype, 
> dest = 2, stag = -16, recvbuf = (nil), rcount = 0, rdatatype, source = 2, 
> rtag = -16, comm, status = (nil)), line 51 in "coll_tuned_util.c"
>   ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 172 in 
> "coll_tuned_barrier.c"
>   ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in 
> "coll_tuned_decision_fixed.c"
>   PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
>   main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219
> 
> The fact that some derived data types were sent before seems to have 
> something to do with it.  I see this sort of problem cropping up in Cisco and 
> Oracle testing.  Up at the level of pml_ob1_send_request_start_copy, at line 
> 485:
> 
>   MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);
> 
> I see
> 
>    *sendreq->req_send.req_base.req_convertor.use_desc = {
>        length = 0
>        used   = 0
>        desc   = (nil)
>    }
> 
> and I guess that desc=NULL is causing the segfault at opal_convertor.c line 
> 404.
> 
> Anyhow, I'm trudging along, but thought I would share at least that much with 
> you helpful folks in case any of this is ringing a bell.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to