Hi All,

I am trying to debug a segfault observed on dapltest-server with OFED-1.5.4.
I am using the daily-build OFED-1.5.4-20111116-0600 for this test.
The test setup involves 4 machines connected via switch.
1 machine acts as dapltest-server while rest 3 machines act as dapltest clients.

We are running several different kinds of RDMA read/write tests on dapl in 
continuous
loop using a script. The test runs fine for around 2 hours or so. And after 
that, the
dapltest-server segfaults with below stack trace:

-----------
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6e30710 (LWP 2397)]
dapl_llist_remove_entry (head=0x636960, entry=0x7ffff0004bf8) at
dapl/common/dapl_llist.c:272
272     dapl/common/dapl_llist.c: No such file or directory.
        in dapl/common/dapl_llist.c
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.7.el6.x86_64
libgcc-4.4.4-13.el6.x86_64
(gdb) bt
#0  dapl_llist_remove_entry (head=0x636960, entry=0x7ffff0004bf8) at
dapl/common/dapl_llist.c:272
#1  0x00007ffff799fb09 in dapl_sp_remove_cr (sp_ptr=0x6368c0,
cr_ptr=0x7ffff0004be0) at dapl/common/dapl_sp_util.c:229
#2  0x00007ffff7998148 in dapli_connection_request (ib_cm_handle=<value
optimized out>, sp_ptr=0x6368c0, prd_ptr=<value optimized out>, 
    private_data_size=<value optimized out>, evd_ptr=0x633fb0) at
dapl/common/dapl_cr_callback.c:424
#3  0x00007ffff799851e in dapls_cr_callback (ib_cm_handle=0x7ffff0004880,
ib_cm_event=IB_CME_CONNECTION_REQUEST_PENDING, 
    private_data_ptr=0x0, private_data_size=0, context=0x6368c0) at
dapl/common/dapl_cr_callback.c:178
#4  0x00007ffff79a4c33 in dapli_cm_passive_cb () at dapl/openib_cma/cm.c:524
#5  dapli_cma_event_cb () at dapl/openib_cma/cm.c:1207
#6  0x00007ffff79a6657 in dapli_thread (arg=<value optimized out>) at
dapl/openib_cma/device.c:692
#7  0x00007ffff79971d1 in dapli_thread_init (thread_draft=0x630320) at
dapl/udapl/linux/dapl_osd.c:590
#8  0x0000003b156077e1 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003b14ee153d in clone () from /lib64/libc.so.6
(gdb) p
The history is empty.
(gdb) info args
head = 0x636960
entry = 0x7ffff0004bf8
(gdb) p *head
$1 = (DAPL_LLIST_HEAD) 0x7ffff00107d8
(gdb) p *entry
$2 = {flink = 0x0, blink = 0x7ffff0003cf8, data = 0x7ffff0004be0, list_head =
0x0}
(gdb) info thread
  950 Thread 0x7ffff7fef710 (LWP 3924)  0x0000003b1560b43c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  949 Thread 0x7ffff5f76710 (LWP 3923)  0x0000003b1560eced in nanosleep () from
/lib64/libpthread.so.0
* 2 Thread 0x7ffff6e30710 (LWP 2397)  dapl_llist_remove_entry (head=0x636960,
entry=0x7ffff0004bf8) at dapl/common/dapl_llist.c:272
  1 Thread 0x7ffff7bb0700 (LWP 2394)  0x0000003b14ed7e33 in poll () from
/lib64/libc.so.6
(gdb) p head
$3 = (DAPL_LLIST_HEAD *) 0x636960
(gdb) p entry
$4 = (DAPL_LLIST_ENTRY *) 0x7ffff0004bf8
(gdb) p *entry
$5 = {flink = 0x0, blink = 0x7ffff0003cf8, data = 0x7ffff0004be0, list_head =
0x0}
(gdb) p *head
$6 = (DAPL_LLIST_HEAD) 0x7ffff00107d8
(gdb) 
-----------


The problematic line in dapl source code is:
-------------
File dapl/common/dapl_llist.c#dapl_llist_remove_entry function:
....
        dapl_os_assert(entry->list_head == head);
        entry->list_head = NULL;

        entry->flink->blink = entry->blink; <===== Problem line. flink is NULL
        entry->blink->flink = entry->flink;
....
--------------

Now, it seems that some time back, a new release of dapl 
(dapl-2.0.34-1.src.rpm) was
introduced in OFED-1.5.4. So, I am just wondering if this is a regression in 
the new
release of dapl?
Or if anyone is aware of this issue and what could possibly lead to this
dapltest-server segfault then, it would be helpful if someone can shed some 
light.


Thanks,
Kumar.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to