Re: [Pvfs2-developers] Does this work for IB and TCP? * SOLVED *

Scott Atchley Thu, 11 Jan 2007 11:48:35 -0800

Hi all,

It was a bug in my BMI_mx_mem[alloc|free]() code. I was taking alock, and in one rare case, I did not release it. The next one to tryto take it deadlocked the app.


The multiple copies now complete and the server can exit cleanly.

Sorry for the distraction.

Scott

On Jan 11, 2007, at 2:09 PM, Scott Atchley wrote:

I do not know if this is related or not, but when I try to ^C theserver, I do not see the normal shutdown messages such as:
PVFS2 server got signal 2 (server_status_flag: 262143)
[D 01/11 13:40] *** server shutdown in progress ***
[D 01/11 13:40] [+] halting state machine processor   [   ...   ]
[D 01/11 13:40] [-]         state machine processor   [ stopped ]
<snip>

Instead, I only get:

PVFS2 server got signal 2 (server_status_flag: 262143)
and nothing else. I have waited more than 5 minutes before ^Z theprocess and killing it. Also, before using ^C, it is unresponsiveto new operations such as pvfs2-ls.
Scott

On Jan 11, 2007, at 1:48 PM, Scott Atchley wrote:
Well, I still can't use the core file but it is not happening inmy BMI_mx_cancel() function. I added print statements at severallocations including the end of the function and all print.
[E 13:36:40.025329] job_time_mgr_expire: job time out: cancellingbmi operation, job_id: 71.[D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancelopid: 72, ptr: 0x810b924.
[D 13:36:40.025577] BMI_cancel: cancel id 72
* BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}
[E 13:36:40.988876] job_time_mgr_expire: job time out: cancellingbmi operation, job_id: 77.[D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancelopid: 78, ptr: 0x810b924.
[D 13:36:40.988945] BMI_cancel: cancel id 78
* BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel  mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[2] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}
I noticed that bmi_ib does not actually cancel an operation, itsimply closes the connection. MX can cancel receives, and I onlyclose the connection if I need to cancel a send. I do not know ifthis is relevant or not.
Scott


On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:
Hi Sam,
I am using a 256 MB file on a machine with only 1 GB of memory.The server sees the timeouts after ~60 seconds and the client ismuch longer (and may be 5 minutes). I will time it on my next run.
Scott

On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:
Hi Scott,
How big is the test file you're copying? tcp doesn't hang withtwo pvfs2-cp on a 40MB, but I should probably try somethinglarger. :-) The flow timeouts on the server are set to 5minutes. Are you waiting that long before seeing those messagesin the log?
-sam

On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
Hi all,
Here is a little more detail. On the server, after the stall Ionly see:
[E 01/11 11:28] job_time_mgr_expire: job time out: cancellingflow operation, job_id: 362.[E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel calledon 0x8197828[D 01/11 11:28] fp_multiqueue_cancel: called on alreadycompleted flow; doing nothing.
There are no timeouts in BMI.

On the client, I eventually see:
[E 11:32:21.004481] job_time_mgr_expire: job time out:cancelling bmi operation, job_id: 41.[D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying tocancel opid: 42, ptr: 0x810c5ac.
[D 11:32:21.004719] BMI_cancel: cancel id 42
* BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer isREADY (peer state 2) */
[2] - segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}[E 11:32:21.993383] job_time_mgr_expire: job time out:cancelling bmi operation, job_id: 53.[D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying tocancel opid: 54, ptr: 0x810c5ac.
[D 11:32:21.993439] BMI_cancel: cancel id 54
* BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer isREADY (peer state 2) */
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/test /mnt/pvfs2/test-${I}
Since the receives are pending (posted, not completed), thecancel should succeed. It is probably a bug in my code thatcauses the segfaults. Unfortunately, the core files are notusable:
% gdb pvfs2-cp core.28788
"core.28788" is not a core dump: File format not recognized

File disagrees:

% file core.28788
core.28788: ELF 32-bit LSB core file Intel 80386, version 1(SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
If the other BMI methods do not hang, then I need to keep digging.

Scott

On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
Hi all,
I am simply running two pvfs2-cp processes at the same time tosee how everything works. For some reason, the copies startbut do not finish. Eventually, I see timeouts in BMI but notin bmi_mx. Before I spend too much time on this, can othermethods run two copies at the same time?
$ for I in 1 2 ; do
pvfs2-cp test /mnt/pvfs2/test-${I} &
done

Thanks,

Scott
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Does this work for IB and TCP? *** SOLVED ***

Reply via email to

Re: [Pvfs2-developers] Does this work for IB and TCP? * SOLVED *