Hi all,

It was a bug in my BMI_mx_mem[alloc|free]() code. I was taking a lock, and in one rare case, I did not release it. The next one to try to take it deadlocked the app.

The multiple copies now complete and the server can exit cleanly.

Sorry for the distraction.

Scott

On Jan 11, 2007, at 2:09 PM, Scott Atchley wrote:

I do not know if this is related or not, but when I try to ^C the server, I do not see the normal shutdown messages such as:

PVFS2 server got signal 2 (server_status_flag: 262143)
[D 01/11 13:40] *** server shutdown in progress ***
[D 01/11 13:40] [+] halting state machine processor   [   ...   ]
[D 01/11 13:40] [-]         state machine processor   [ stopped ]
<snip>

Instead, I only get:

PVFS2 server got signal 2 (server_status_flag: 262143)

and nothing else. I have waited more than 5 minutes before ^Z the process and killing it. Also, before using ^C, it is unresponsive to new operations such as pvfs2-ls.

Scott

On Jan 11, 2007, at 1:48 PM, Scott Atchley wrote:

Well, I still can't use the core file but it is not happening in my BMI_mx_cancel() function. I added print statements at several locations including the end of the function and all print.

[E 13:36:40.025329] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 71. [D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancel opid: 72, ptr: 0x810b924.
[D 13:36:40.025577] BMI_cancel: cancel id 72
* BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/ test /mnt/pvfs2/test-${I}

[E 13:36:40.988876] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 77. [D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancel opid: 78, ptr: 0x810b924.
[D 13:36:40.988945] BMI_cancel: cancel id 78
* BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel  mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[2] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/ test /mnt/pvfs2/test-${I}

I noticed that bmi_ib does not actually cancel an operation, it simply closes the connection. MX can cancel receives, and I only close the connection if I need to cancel a send. I do not know if this is relevant or not.

Scott


On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:

Hi Sam,

I am using a 256 MB file on a machine with only 1 GB of memory. The server sees the timeouts after ~60 seconds and the client is much longer (and may be 5 minutes). I will time it on my next run.

Scott

On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:


Hi Scott,

How big is the test file you're copying? tcp doesn't hang with two pvfs2-cp on a 40MB, but I should probably try something larger. :-) The flow timeouts on the server are set to 5 minutes. Are you waiting that long before seeing those messages in the log?

-sam

On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:

Hi all,

Here is a little more detail. On the server, after the stall I only see:

[E 01/11 11:28] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 362. [E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called on 0x8197828 [D 01/11 11:28] fp_multiqueue_cancel: called on already completed flow; doing nothing.

There are no timeouts in BMI.

On the client, I eventually see:

[E 11:32:21.004481] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 41. [D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to cancel opid: 42, ptr: 0x810c5ac.
[D 11:32:21.004719] BMI_cancel: cancel id 42
* BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer is READY (peer state 2) */

[2] - segmentation fault (core dumped) pvfs2-cp /scratch/ atchley/test /mnt/pvfs2/test-${I} [E 11:32:21.993383] job_time_mgr_expire: job time out: cancelling bmi operation, job_id: 53. [D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to cancel opid: 54, ptr: 0x810c5ac.
[D 11:32:21.993439] BMI_cancel: cancel id 54
* BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer is READY (peer state 2) */

[3] + segmentation fault (core dumped) pvfs2-cp /scratch/ atchley/test /mnt/pvfs2/test-${I}

Since the receives are pending (posted, not completed), the cancel should succeed. It is probably a bug in my code that causes the segfaults. Unfortunately, the core files are not usable:

% gdb pvfs2-cp core.28788
"core.28788" is not a core dump: File format not recognized

File disagrees:

% file core.28788
core.28788: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'

If the other BMI methods do not hang, then I need to keep digging.

Scott

On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:

Hi all,

I am simply running two pvfs2-cp processes at the same time to see how everything works. For some reason, the copies start but do not finish. Eventually, I see timeouts in BMI but not in bmi_mx. Before I spend too much time on this, can other methods run two copies at the same time?

$ for I in 1 2 ; do
pvfs2-cp test /mnt/pvfs2/test-${I} &
done

Thanks,

Scott
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2- developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2- developers



_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to