> Another question: when pvfs2-cp failed and you got the timing > messages on the client, did you also get timing messages on any of the > servers at about the same time?
This is errors on the server when I tried copying files. BTW, it happens not only from NFS to PVFS, but also local to/from PVFS, PVFS to/from PVFS. [E 07/19/2011 11:19:29] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 770776. [E 07/19/2011 11:19:29] fp_multiqueue_cancel: flow proto cancel called on 0x2aaaac008020 [E 07/19/2011 11:19:29] fp_multiqueue_cancel: I/O error occurred [E 07/19/2011 11:19:29] handle_io_error: flow proto error cleanup started on 0x2aaaac008020: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020 canceled 1 operations, will clean up. [E 07/19/2011 11:19:29] bmi_recv_callback_fn: I/O error occurred [E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020 error cleanup finished: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:31:40] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 772303. [E 07/19/2011 11:31:40] fp_multiqueue_cancel: flow proto cancel called on 0x2aaaac005ea0 [E 07/19/2011 11:31:40] fp_multiqueue_cancel: I/O error occurred [E 07/19/2011 11:31:40] handle_io_error: flow proto error cleanup started on 0x2aaaac005ea0: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0 canceled 1 operations, will clean up. [E 07/19/2011 11:31:40] bmi_recv_callback_fn: I/O error occurred [E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0 error cleanup finished: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:33:13] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 772529. [E 07/19/2011 11:33:13] fp_multiqueue_cancel: flow proto cancel called on 0x2aaaac2792a0 [E 07/19/2011 11:33:13] fp_multiqueue_cancel: I/O error occurred [E 07/19/2011 11:33:13] handle_io_error: flow proto error cleanup started on 0x2aaaac2792a0: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0 canceled 1 operations, will clean up. [E 07/19/2011 11:33:13] bmi_recv_callback_fn: I/O error occurred [E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0 error cleanup finished: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:47:29] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 774732. [E 07/19/2011 11:47:29] fp_multiqueue_cancel: flow proto cancel called on 0x2aaaac043410 [E 07/19/2011 11:47:29] fp_multiqueue_cancel: I/O error occurred [E 07/19/2011 11:47:29] handle_io_error: flow proto error cleanup started on 0x2aaaac043410: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410 canceled 1 operations, will clean up. [E 07/19/2011 11:47:29] bmi_recv_callback_fn: I/O error occurred [E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410 error cleanup finished: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:55:43] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 775375. [E 07/19/2011 11:55:43] fp_multiqueue_cancel: flow proto cancel called on 0x2aaaac279050 [E 07/19/2011 11:55:43] fp_multiqueue_cancel: I/O error occurred [E 07/19/2011 11:55:43] handle_io_error: flow proto error cleanup started on 0x2aaaac279050: Operation cancelled (possibly due to timeout) [E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050 canceled 1 operations, will clean up. [E 07/19/2011 11:55:43] bmi_recv_callback_fn: I/O error occurred [E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050 error cleanup finished: Operation cancelled (possibly due to timeout) [E 07/19/2011 12:01:32] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 778783. [E 07/19/2011 12:01:32] fp_multiqueue_cancel: flow proto cancel called on 0x2aaaac0070a0 [E 07/19/2011 12:01:32] fp_multiqueue_cancel: I/O error occurred [E 07/19/2011 12:01:32] handle_io_error: flow proto error cleanup started on 0x2aaaac0070a0: Operation cancelled (possibly due to timeout) [E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0 canceled 1 operations, will clean up. [E 07/19/2011 12:01:32] bmi_recv_callback_fn: I/O error occurred [E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0 error cleanup finished: Operation cancelled (possibly due to timeout) Thanks, Mi > > Becky > > On Tue, Jul 19, 2011 at 4:40 PM, Mi Zhou <[email protected]> wrote: > Hi Becky, > > Now it does not time out but one of the pvfs2 server nodes > crashed in > the middle of the copy: > > [D 07/19/2011 15:31:54] ib_check_cq: recv from > 172.20.101.34:45263 len > 16 type MSG_RTS_DONE credit 1. > [D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE > mop_id > 11957200. > [E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id > 11957200 > in RTS_DONE message not found. > [E 07/19/2011 15:31:54] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a] > [E 07/19/2011 15:31:54] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x446f60] > [E 07/19/2011 15:31:54] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x448aa9] > [E 07/19/2011 15:31:54] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected > +0x383) > [0x445293] > [E 07/19/2011 15:31:54] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x47b77c] > [E 07/19/2011 15:31:54] [bt] /lib64/libpthread.so.0 > [0x3ba820673d] > [E 07/19/2011 15:31:54] [bt] /lib64/libc.so.6(clone > +0x6d) > [0x3ba7ad44bd] > > > Thanks, > > Mi > > > On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote: > > Mi: > > > > In your configuration file set the following: > > > > <Defaults> > > ServerJobFlowTimeoutSecs 600 > > ClientJobFlowTimeoutSecs 600 > > > > ServerJobBMITimeoutSecs 600 > > ClientJobBMITimeoutSecs 600 > > </Defaults> > > > > Normally, these timeouts are 300 seconds (5 minutes). See > if this > > helps with the NFS-to-PVFS 75GB copy. > > > > I will also check into why pvfs2-cp issued that assert. > Most likely, > > the code isn't handling error conditions properly. > > > > Becky > > > > > > > > On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon > <[email protected]> > > wrote: > > Mi: > > > > I believe you need to increase the job timer > configuration > > option. Give me a few minutes and I'll send you the > exact > > information. > > > > If you can avoid using NFS and copy directly from > the physical > > source, your copy will execute much quicker. > > > > Becky > > > > > > > > On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou > <[email protected]> > > wrote: > > Hi, > > > > I tried to "pvfs-cp" a 75G file from NFS to > PVFS, it > > stalled at 1.5G and > > after a while I got this error: > > > > [E 11:28:15.521435] job_time_mgr_expire: job > time out: > > cancelling flow > > operation, job_id: 2020. > > [E 11:28:15.521578] fp_multiqueue_cancel: > flow proto > > cancel called on > > 0x1468bc78 > > [E 11:28:15.521587] fp_multiqueue_cancel: > I/O error > > occurred > > [E 11:28:15.521616] handle_io_error: flow > proto error > > cleanup started on > > 0x1468bc78: Operation cancelled (possibly > due to > > timeout) > > [E 11:28:15.521667] handle_io_error: flow > proto > > 0x1468bc78 canceled 1 > > operations, will clean up. > > [E 11:28:15.522059] mem_to_bmi_callback_fn: > I/O error > > occurred > > [E 11:28:15.522124] handle_io_error: flow > proto > > 0x1468bc78 error cleanup > > finished: Operation cancelled (possibly due > to > > timeout) > > pvfs2-cp: src/client/sysint/sys-io.sm:1423: > > io_datafile_complete_operations: Assertion > > `cur_ctx->write_ack.recv_status.actual_size > <= > > cur_ctx->write_ack.max_resp_sz' failed. > > Aborted > > > > > > Any advice is very much appreciated. > > > > Thanks, > > > > > > -- > > > > Mi Zhou > > System Integration Engineer > > Information Sciences > > St. Jude Children's Research Hospital > > 262 Danny Thomas Pl. MS 312 > > Memphis, TN 38105 > > 901.595.5771 > > > > > > Email Disclaimer: > www.stjude.org/emaildisclaimer > > > > > > > _______________________________________________ > > Pvfs2-users mailing list > > [email protected] > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > -- > > Becky Ligon > > OrangeFS Support and Development > > Omnibond Systems > > Anderson, South Carolina > > > > > > > > > > > > -- > > Becky Ligon > > OrangeFS Support and Development > > Omnibond Systems > > Anderson, South Carolina > > > > > > -- > > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 > > > > > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > -- Mi Zhou System Integration Engineer Information Sciences St. Jude Children's Research Hospital 262 Danny Thomas Pl. MS 312 Memphis, TN 38105 901.595.5771 _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
