Hi Becky, Now it does not time out but one of the pvfs2 server nodes crashed in the middle of the copy:
[D 07/19/2011 15:31:54] ib_check_cq: recv from 172.20.101.34:45263 len 16 type MSG_RTS_DONE credit 1. [D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE mop_id 11957200. [E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id 11957200 in RTS_DONE message not found. [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a] [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x446f60] [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x448aa9] [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected+0x383) [0x445293] [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server [0x47b77c] [E 07/19/2011 15:31:54] [bt] /lib64/libpthread.so.0 [0x3ba820673d] [E 07/19/2011 15:31:54] [bt] /lib64/libc.so.6(clone+0x6d) [0x3ba7ad44bd] Thanks, Mi On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote: > Mi: > > In your configuration file set the following: > > <Defaults> > ServerJobFlowTimeoutSecs 600 > ClientJobFlowTimeoutSecs 600 > > ServerJobBMITimeoutSecs 600 > ClientJobBMITimeoutSecs 600 > </Defaults> > > Normally, these timeouts are 300 seconds (5 minutes). See if this > helps with the NFS-to-PVFS 75GB copy. > > I will also check into why pvfs2-cp issued that assert. Most likely, > the code isn't handling error conditions properly. > > Becky > > > > On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon <[email protected]> > wrote: > Mi: > > I believe you need to increase the job timer configuration > option. Give me a few minutes and I'll send you the exact > information. > > If you can avoid using NFS and copy directly from the physical > source, your copy will execute much quicker. > > Becky > > > > On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou <[email protected]> > wrote: > Hi, > > I tried to "pvfs-cp" a 75G file from NFS to PVFS, it > stalled at 1.5G and > after a while I got this error: > > [E 11:28:15.521435] job_time_mgr_expire: job time out: > cancelling flow > operation, job_id: 2020. > [E 11:28:15.521578] fp_multiqueue_cancel: flow proto > cancel called on > 0x1468bc78 > [E 11:28:15.521587] fp_multiqueue_cancel: I/O error > occurred > [E 11:28:15.521616] handle_io_error: flow proto error > cleanup started on > 0x1468bc78: Operation cancelled (possibly due to > timeout) > [E 11:28:15.521667] handle_io_error: flow proto > 0x1468bc78 canceled 1 > operations, will clean up. > [E 11:28:15.522059] mem_to_bmi_callback_fn: I/O error > occurred > [E 11:28:15.522124] handle_io_error: flow proto > 0x1468bc78 error cleanup > finished: Operation cancelled (possibly due to > timeout) > pvfs2-cp: src/client/sysint/sys-io.sm:1423: > io_datafile_complete_operations: Assertion > `cur_ctx->write_ack.recv_status.actual_size <= > cur_ctx->write_ack.max_resp_sz' failed. > Aborted > > > Any advice is very much appreciated. > > Thanks, > > > -- > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > _______________________________________________ > Pvfs2-users mailing list > [email protected] > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > > > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > -- Mi Zhou System Integration Engineer Information Sciences St. Jude Children's Research Hospital 262 Danny Thomas Pl. MS 312 Memphis, TN 38105 901.595.5771 _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
