Re: [Pvfs2-developers] Does this work for IB and TCP?

Scott Atchley Thu, 11 Jan 2007 11:11:34 -0800

I do not know if this is related or not, but when I try to ^C theserver, I do not see the normal shutdown messages such as:


PVFS2 server got signal 2 (server_status_flag: 262143)
[D 01/11 13:40] *** server shutdown in progress ***
[D 01/11 13:40] [+] halting state machine processor   [   ...   ]
[D 01/11 13:40] [-]         state machine processor   [ stopped ]
<snip>


Instead, I only get:

PVFS2 server got signal 2 (server_status_flag: 262143)

and nothing else. I have waited more than 5 minutes before ^Z theprocess and killing it. Also, before using ^C, it is unresponsive tonew operations such as pvfs2-ls.


Scott

On Jan 11, 2007, at 1:48 PM, Scott Atchley wrote:

Well, I still can't use the core file but it is not happening in myBMI_mx_cancel() function. I added print statements at severallocations including the end of the function and all print.
[E 13:36:40.025329] job_time_mgr_expire: job time out: cancellingbmi operation, job_id: 71.[D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancelopid: 72, ptr: 0x810b924.
[D 13:36:40.025577] BMI_cancel: cancel id 72
* BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}
[E 13:36:40.988876] job_time_mgr_expire: job time out: cancellingbmi operation, job_id: 77.[D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancelopid: 78, ptr: 0x810b924.
[D 13:36:40.988945] BMI_cancel: cancel id 78
* BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel  mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[2] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}
I noticed that bmi_ib does not actually cancel an operation, itsimply closes the connection. MX can cancel receives, and I onlyclose the connection if I need to cancel a send. I do not know ifthis is relevant or not.
Scott


On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:
Hi Sam,
I am using a 256 MB file on a machine with only 1 GB of memory.The server sees the timeouts after ~60 seconds and the client ismuch longer (and may be 5 minutes). I will time it on my next run.
Scott

On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:
Hi Scott,
How big is the test file you're copying? tcp doesn't hang withtwo pvfs2-cp on a 40MB, but I should probably try somethinglarger. :-) The flow timeouts on the server are set to 5minutes. Are you waiting that long before seeing those messagesin the log?
-sam

On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
Hi all,
Here is a little more detail. On the server, after the stall Ionly see:
[E 01/11 11:28] job_time_mgr_expire: job time out: cancellingflow operation, job_id: 362.[E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel calledon 0x8197828[D 01/11 11:28] fp_multiqueue_cancel: called on alreadycompleted flow; doing nothing.
There are no timeouts in BMI.

On the client, I eventually see:
[E 11:32:21.004481] job_time_mgr_expire: job time out:cancelling bmi operation, job_id: 41.[D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to cancelopid: 42, ptr: 0x810c5ac.
[D 11:32:21.004719] BMI_cancel: cancel id 42
* BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer isREADY (peer state 2) */
[2] - segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}[E 11:32:21.993383] job_time_mgr_expire: job time out:cancelling bmi operation, job_id: 53.[D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to cancelopid: 54, ptr: 0x810c5ac.
[D 11:32:21.993439] BMI_cancel: cancel id 54
* BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer isREADY (peer state 2) */
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}
Since the receives are pending (posted, not completed), thecancel should succeed. It is probably a bug in my code thatcauses the segfaults. Unfortunately, the core files are not usable:
% gdb pvfs2-cp core.28788
"core.28788" is not a core dump: File format not recognized

File disagrees:

% file core.28788
core.28788: ELF 32-bit LSB core file Intel 80386, version 1(SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
If the other BMI methods do not hang, then I need to keep digging.

Scott

On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
Hi all,
I am simply running two pvfs2-cp processes at the same time tosee how everything works. For some reason, the copies start butdo not finish. Eventually, I see timeouts in BMI but not inbmi_mx. Before I spend too much time on this, can other methodsrun two copies at the same time?
$ for I in 1 2 ; do
pvfs2-cp test /mnt/pvfs2/test-${I} &
done

Thanks,

Scott
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Does this work for IB and TCP?

Reply via email to