Let me take a look at the BMI-ib module and see if I can figure out why you are getting these error messages.
Another question: when pvfs2-cp failed and you got the timing messages on the client, did you also get timing messages on any of the servers at about the same time? Becky On Tue, Jul 19, 2011 at 4:40 PM, Mi Zhou <[email protected]> wrote: > Hi Becky, > > Now it does not time out but one of the pvfs2 server nodes crashed in > the middle of the copy: > > [D 07/19/2011 15:31:54] ib_check_cq: recv from 172.20.101.34:45263 len > 16 type MSG_RTS_DONE credit 1. > [D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE mop_id > 11957200. > [E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id 11957200 > in RTS_DONE message not found. > [E 07/19/2011 15:31:54] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a] > [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x446f60] > [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x448aa9] > [E 07/19/2011 15:31:54] > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected+0x383) > [0x445293] > [E 07/19/2011 15:31:54] [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > [0x47b77c] > [E 07/19/2011 15:31:54] [bt] /lib64/libpthread.so.0 > [0x3ba820673d] > [E 07/19/2011 15:31:54] [bt] /lib64/libc.so.6(clone+0x6d) > [0x3ba7ad44bd] > > > Thanks, > > Mi > > On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote: > > Mi: > > > > In your configuration file set the following: > > > > <Defaults> > > ServerJobFlowTimeoutSecs 600 > > ClientJobFlowTimeoutSecs 600 > > > > ServerJobBMITimeoutSecs 600 > > ClientJobBMITimeoutSecs 600 > > </Defaults> > > > > Normally, these timeouts are 300 seconds (5 minutes). See if this > > helps with the NFS-to-PVFS 75GB copy. > > > > I will also check into why pvfs2-cp issued that assert. Most likely, > > the code isn't handling error conditions properly. > > > > Becky > > > > > > > > On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon <[email protected]> > > wrote: > > Mi: > > > > I believe you need to increase the job timer configuration > > option. Give me a few minutes and I'll send you the exact > > information. > > > > If you can avoid using NFS and copy directly from the physical > > source, your copy will execute much quicker. > > > > Becky > > > > > > > > On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou <[email protected]> > > wrote: > > Hi, > > > > I tried to "pvfs-cp" a 75G file from NFS to PVFS, it > > stalled at 1.5G and > > after a while I got this error: > > > > [E 11:28:15.521435] job_time_mgr_expire: job time out: > > cancelling flow > > operation, job_id: 2020. > > [E 11:28:15.521578] fp_multiqueue_cancel: flow proto > > cancel called on > > 0x1468bc78 > > [E 11:28:15.521587] fp_multiqueue_cancel: I/O error > > occurred > > [E 11:28:15.521616] handle_io_error: flow proto error > > cleanup started on > > 0x1468bc78: Operation cancelled (possibly due to > > timeout) > > [E 11:28:15.521667] handle_io_error: flow proto > > 0x1468bc78 canceled 1 > > operations, will clean up. > > [E 11:28:15.522059] mem_to_bmi_callback_fn: I/O error > > occurred > > [E 11:28:15.522124] handle_io_error: flow proto > > 0x1468bc78 error cleanup > > finished: Operation cancelled (possibly due to > > timeout) > > pvfs2-cp: src/client/sysint/sys-io.sm:1423: > > io_datafile_complete_operations: Assertion > > `cur_ctx->write_ack.recv_status.actual_size <= > > cur_ctx->write_ack.max_resp_sz' failed. > > Aborted > > > > > > Any advice is very much appreciated. > > > > Thanks, > > > > > > -- > > > > Mi Zhou > > System Integration Engineer > > Information Sciences > > St. Jude Children's Research Hospital > > 262 Danny Thomas Pl. MS 312 > > Memphis, TN 38105 > > 901.595.5771 > > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > > > > _______________________________________________ > > Pvfs2-users mailing list > > [email protected] > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > -- > > Becky Ligon > > OrangeFS Support and Development > > Omnibond Systems > > Anderson, South Carolina > > > > > > > > > > > > -- > > Becky Ligon > > OrangeFS Support and Development > > Omnibond Systems > > Anderson, South Carolina > > > > > -- > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 > > > -- Becky Ligon OrangeFS Support and Development Omnibond Systems Anderson, South Carolina
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
