Backing out r26597 solves my particular test cases. I'll back it out of
the trunk as well unless someone has objections.
I like how you say "same segfault." In certain cases, I just go on to
different segfaults. E.g.,
[2] btl_openib_handle_incoming(openib_btl, ep, frag, byte_len = 20U),
line 3208 in "btl_openib_component.c"
[3] handle_wc(device, cq = 0, wc), line 3516 in "btl_openib_component.c"
[4] poll_device(device, count = 1), line 3654 in "btl_openib_component.c"
[5] progress_one_device(device), line 3762 in "btl_openib_component.c"
[6] btl_openib_component_progress(), line 3787 in
"btl_openib_component.c"
[7] opal_progress(), line 207 in "opal_progress.c"
[8] opal_condition_wait(c, m), line 100 in "condition.h"
[9] ompi_request_default_wait_all(count = 2U, requests, statuses),
line 281 in "req_wait.c"
[10] ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0,
sdatatype, dest = 0, stag = -16, recvbuf = (nil), rcount = 0, rdatatype,
source = 0, rtag = -16, comm, status = (nil)), line 54 in
"coll_tuned_util.c"
[11] ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module),
line 172 in "coll_tuned_barrier.c"
[12] ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207
in "coll_tuned_decision_fixed.c"
[13] PMPI_Barrier(comm = 0x518370), line 62 in "pbarrier.c"
The reg->cbfunc is NULL. I'm still considering whether that's an
artifact of how I build that particular case, though.
On 06/15/12 09:44, George Bosilca wrote:
There should be no datatype attached to the barrier, so it is normal you get
the zero values in the convertor.
Something weird is definitively going on. As there is no data to be sent, the
opal_convertor_set_position function is supposed to trigger the special path,
mark the convertor as completed and return successfully. However, this seems
not to be the case anymore as in your backtrace I see the call to
opal_convertor_set_position_nocheck, which only happens if the above described
test fails.
I had some doubts about r26597, but I don't have time to check into it until
Monday. Maybe you can remove it and se if you continue to have the same
segfault.
george.
On Jun 15, 2012, at 01:24 , Eugene Loh wrote:
I see a segfault show up in trunk testing starting with r26598 when tests like
ibm collective/struct_gatherv
intel src/MPI_Type_free_[types|pending_msg]_[f|c]
are run over openib. Here is a typical stack trace:
opal_convertor_create_stack_at_begining(convertor = 0x689730, sizes), line 404 in
"opal_convertor.c"
opal_convertor_set_position_nocheck(convertor = 0x689730, position), line 423 in
"opal_convertor.c"
opal_convertor_set_position(convertor = 0x689730, position = 0x7fffc36e0bf0), line 321
in "opal_convertor.h"
mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, size = 0), line 485
in "pml_ob1_sendreq.c"
mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in
"pml_ob1_sendreq.h"
mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in
"pml_ob1_sendreq.h"
mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = -16, sendmode =
MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in "pml_ob1_isend.c"
ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, sdatatype, dest = 2, stag
= -16, recvbuf = (nil), rcount = 0, rdatatype, source = 2, rtag = -16, comm, status =
(nil)), line 51 in "coll_tuned_util.c"
ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 172 in
"coll_tuned_barrier.c"
ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in
"coll_tuned_decision_fixed.c"
PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219
The fact that some derived data types were sent before seems to have something
to do with it. I see this sort of problem cropping up in Cisco and Oracle
testing. Up at the level of pml_ob1_send_request_start_copy, at line 485:
MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);
I see
*sendreq->req_send.req_base.req_convertor.use_desc = {
length = 0
used = 0
desc = (nil)
}
and I guess that desc=NULL is causing the segfault at opal_convertor.c line 404.
Anyhow, I'm trudging along, but thought I would share at least that much with
you helpful folks in case any of this is ringing a bell.