Hi,
On 11/19/2011 05:18 AM, Davis, Arlin R wrote:
#0 dapl_llist_remove_entry (head=0x636960, entry=0x7ffff0004bf8) at
[...]
You should have seen a message like "WARNING: overflow event on EVD".
It appears that the default dapltest server allocates too small of a CR EVD for
many client test configurations. When it hits the overflow queue case, the CR
callback incorrectly frees the CR before it is removed from SP list. In your
case, I am guessing that another CR came in on another thread and this memory
was reallocated with flink ptr reinitialized.
Please try the following patches.
---------
Common: CR EVD overflow causes segfault.
The CR is freed up incorrectly before unlinking with SP.
Signed-off-by: Arlin Davis<[email protected]>
diff --git a/dapl/common/dapl_cr_callback.c b/dapl/common/dapl_cr_callback.c
index 3997b38..c58444b 100644
--- a/dapl/common/dapl_cr_callback.c
+++ b/dapl/common/dapl_cr_callback.c
@@ -414,7 +414,6 @@ dapli_connection_request(IN dp_ib_cm_handle_t ib_cm_handle,
(DAT_CR_HANDLE) cr_ptr);
if (dat_status != DAT_SUCCESS) {
- dapls_cr_free(cr_ptr);
(void)dapls_ib_reject_connection(ib_cm_handle,
DAT_CONNECTION_EVENT_BROKEN,
0, NULL);
@@ -423,6 +422,7 @@ dapli_connection_request(IN dp_ib_cm_handle_t ib_cm_handle,
dapl_os_lock(&sp_ptr->header.lock);
dapl_sp_remove_cr(sp_ptr, cr_ptr);
dapl_os_unlock(&sp_ptr->header.lock);
+ dapls_cr_free(cr_ptr);
return DAT_INSUFFICIENT_RESOURCES;
}
----------
dapltest: server CR EVD is too small for multi-client configurations.
Increase default size from 8 to 32.
Signed-off-by: Arlin Davis<[email protected]>
diff --git a/test/dapltest/test/dapl_server.c b/test/dapltest/test/dapl_server.c
index 443425c..92e0d21 100644
--- a/test/dapltest/test/dapl_server.c
+++ b/test/dapltest/test/dapl_server.c
@@ -34,7 +34,7 @@
#undef DFLT_QLEN
#endif
-#define DFLT_QLEN 8 /* default event queue length */
+#define DFLT_QLEN 32 /* default event queue length */
int send_control_data(DT_Tdep_Print_Head * phead,
unsigned char *buffp,
Thank you for the two patches. I tried the two patches and now, I have
not seen a segfault till now on dapl-server at least.
However, after about 2 hours of test, some of dapl-client throws below
error on console:
----
Server Name: 3.4.5.1
Server Net Address: 3.4.5.1
DT_cs_Client: Starting Test ...
FAIL: 16 Server test connections did not report ready.
FAIL: 16 Server test connections did not report ready.
----
dapl-client is stalled at this stage, and needs to be manually killed by
Ctrl+C.
And below errors are seen on dapl-server console:
----
Test Error: Client_Mem_Info_Send-reaping DTO problem, status = FAILURE
Test Error: Client_Mem_Info_Send-reaping DTO problem, status = FAILURE
Test[b368]: Warning: dat_ep_disconnect (abrupt) #2 error
DAT_INVALID_STATE DAT_INVALID_STATE_EP_UNCONNECTED
Test[b368]: dat_evd_free (creq) error: DAT_INVALID_STATE
DAT_INVALID_STATE_EVD_IN_USE
Test[b368]: Warning: dat_ep_disconnect (abrupt) #3 error
DAT_INVALID_STATE DAT_INVALID_STATE_EP_UNCONNECTED
Test[b368]: dat_evd_free (creq) error: DAT_INVALID_STATE
DAT_INVALID_STATE_EVD_IN_USE
...
----
No message is seen in dmesg on either dapl-server or dapl-client machine.
If I manually kill the dapl-client, and restart it then, test again
starts fine and runs for about 2 hours or so.
Thanks,
Kumar.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html