Hi all,
It was a bug in my BMI_mx_mem[alloc|free]() code. I was taking a
lock, and in one rare case, I did not release it. The next one to try
to take it deadlocked the app.
The multiple copies now complete and the server can exit cleanly.
Sorry for the distraction.
Scott
On Jan 11, 2007, at 2:09 PM, Scott Atchley wrote:
I do not know if this is related or not, but when I try to ^C the
server, I do not see the normal shutdown messages such as:
PVFS2 server got signal 2 (server_status_flag: 262143)
[D 01/11 13:40] *** server shutdown in progress ***
[D 01/11 13:40] [+] halting state machine processor [ ... ]
[D 01/11 13:40] [-] state machine processor [ stopped ]
<snip>
Instead, I only get:
PVFS2 server got signal 2 (server_status_flag: 262143)
and nothing else. I have waited more than 5 minutes before ^Z the
process and killing it. Also, before using ^C, it is unresponsive
to new operations such as pvfs2-ls.
Scott
On Jan 11, 2007, at 1:48 PM, Scott Atchley wrote:
Well, I still can't use the core file but it is not happening in
my BMI_mx_cancel() function. I added print statements at several
locations including the end of the function and all print.
[E 13:36:40.025329] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 71.
[D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancel
opid: 72, ptr: 0x810b924.
[D 13:36:40.025577] BMI_cancel: cancel id 72
* BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
test /mnt/pvfs2/test-${I}
[E 13:36:40.988876] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 77.
[D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancel
opid: 78, ptr: 0x810b924.
[D 13:36:40.988945] BMI_cancel: cancel id 78
* BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[2] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
test /mnt/pvfs2/test-${I}
I noticed that bmi_ib does not actually cancel an operation, it
simply closes the connection. MX can cancel receives, and I only
close the connection if I need to cancel a send. I do not know if
this is relevant or not.
Scott
On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:
Hi Sam,
I am using a 256 MB file on a machine with only 1 GB of memory.
The server sees the timeouts after ~60 seconds and the client is
much longer (and may be 5 minutes). I will time it on my next run.
Scott
On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:
Hi Scott,
How big is the test file you're copying? tcp doesn't hang with
two pvfs2-cp on a 40MB, but I should probably try something
larger. :-) The flow timeouts on the server are set to 5
minutes. Are you waiting that long before seeing those messages
in the log?
-sam
On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
Hi all,
Here is a little more detail. On the server, after the stall I
only see:
[E 01/11 11:28] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 362.
[E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called
on 0x8197828
[D 01/11 11:28] fp_multiqueue_cancel: called on already
completed flow; doing nothing.
There are no timeouts in BMI.
On the client, I eventually see:
[E 11:32:21.004481] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 41.
[D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to
cancel opid: 42, ptr: 0x810c5ac.
[D 11:32:21.004719] BMI_cancel: cancel id 42
* BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer is
READY (peer state 2) */
[2] - segmentation fault (core dumped) pvfs2-cp /scratch/
atchley/test /mnt/pvfs2/test-${I}
[E 11:32:21.993383] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 53.
[D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to
cancel opid: 54, ptr: 0x810c5ac.
[D 11:32:21.993439] BMI_cancel: cancel id 54
* BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer is
READY (peer state 2) */
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/
atchley/test /mnt/pvfs2/test-${I}
Since the receives are pending (posted, not completed), the
cancel should succeed. It is probably a bug in my code that
causes the segfaults. Unfortunately, the core files are not
usable:
% gdb pvfs2-cp core.28788
"core.28788" is not a core dump: File format not recognized
File disagrees:
% file core.28788
core.28788: ELF 32-bit LSB core file Intel 80386, version 1
(SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
If the other BMI methods do not hang, then I need to keep digging.
Scott
On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
Hi all,
I am simply running two pvfs2-cp processes at the same time to
see how everything works. For some reason, the copies start
but do not finish. Eventually, I see timeouts in BMI but not
in bmi_mx. Before I spend too much time on this, can other
methods run two copies at the same time?
$ for I in 1 2 ; do
pvfs2-cp test /mnt/pvfs2/test-${I} &
done
Thanks,
Scott
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-
developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-
developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers