Thanks! Becky
On Tue, Jul 19, 2011 at 5:35 PM, Mi Zhou <[email protected]> wrote: > > > Another question: when pvfs2-cp failed and you got the timing > > messages on the client, did you also get timing messages on any of the > > servers at about the same time? > > This is errors on the server when I tried copying files. BTW, it happens > not only from NFS to PVFS, but also local to/from PVFS, PVFS to/from > PVFS. > > [E 07/19/2011 11:19:29] job_time_mgr_expire: job time out: cancelling > flow operation, job_id: 770776. > [E 07/19/2011 11:19:29] fp_multiqueue_cancel: flow proto cancel called > on 0x2aaaac008020 > [E 07/19/2011 11:19:29] fp_multiqueue_cancel: I/O error occurred > [E 07/19/2011 11:19:29] handle_io_error: flow proto error cleanup > started on 0x2aaaac008020: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020 > canceled 1 operations, will clean up. > [E 07/19/2011 11:19:29] bmi_recv_callback_fn: I/O error occurred > [E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020 error > cleanup finished: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:31:40] job_time_mgr_expire: job time out: cancelling > flow operation, job_id: 772303. > [E 07/19/2011 11:31:40] fp_multiqueue_cancel: flow proto cancel called > on 0x2aaaac005ea0 > [E 07/19/2011 11:31:40] fp_multiqueue_cancel: I/O error occurred > [E 07/19/2011 11:31:40] handle_io_error: flow proto error cleanup > started on 0x2aaaac005ea0: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0 > canceled 1 operations, will clean up. > [E 07/19/2011 11:31:40] bmi_recv_callback_fn: I/O error occurred > [E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0 error > cleanup finished: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:33:13] job_time_mgr_expire: job time out: cancelling > flow operation, job_id: 772529. > [E 07/19/2011 11:33:13] fp_multiqueue_cancel: flow proto cancel called > on 0x2aaaac2792a0 > [E 07/19/2011 11:33:13] fp_multiqueue_cancel: I/O error occurred > [E 07/19/2011 11:33:13] handle_io_error: flow proto error cleanup > started on 0x2aaaac2792a0: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0 > canceled 1 operations, will clean up. > [E 07/19/2011 11:33:13] bmi_recv_callback_fn: I/O error occurred > [E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0 error > cleanup finished: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:47:29] job_time_mgr_expire: job time out: cancelling > flow operation, job_id: 774732. > [E 07/19/2011 11:47:29] fp_multiqueue_cancel: flow proto cancel called > on 0x2aaaac043410 > [E 07/19/2011 11:47:29] fp_multiqueue_cancel: I/O error occurred > [E 07/19/2011 11:47:29] handle_io_error: flow proto error cleanup > started on 0x2aaaac043410: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410 > canceled 1 operations, will clean up. > [E 07/19/2011 11:47:29] bmi_recv_callback_fn: I/O error occurred > [E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410 error > cleanup finished: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:55:43] job_time_mgr_expire: job time out: cancelling > flow operation, job_id: 775375. > [E 07/19/2011 11:55:43] fp_multiqueue_cancel: flow proto cancel called > on 0x2aaaac279050 > [E 07/19/2011 11:55:43] fp_multiqueue_cancel: I/O error occurred > [E 07/19/2011 11:55:43] handle_io_error: flow proto error cleanup > started on 0x2aaaac279050: Operation cancelled (possibly due to timeout) > [E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050 > canceled 1 operations, will clean up. > [E 07/19/2011 11:55:43] bmi_recv_callback_fn: I/O error occurred > [E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050 error > cleanup finished: Operation cancelled (possibly due to timeout) > [E 07/19/2011 12:01:32] job_time_mgr_expire: job time out: cancelling > flow operation, job_id: 778783. > [E 07/19/2011 12:01:32] fp_multiqueue_cancel: flow proto cancel called > on 0x2aaaac0070a0 > [E 07/19/2011 12:01:32] fp_multiqueue_cancel: I/O error occurred > [E 07/19/2011 12:01:32] handle_io_error: flow proto error cleanup > started on 0x2aaaac0070a0: Operation cancelled (possibly due to timeout) > [E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0 > canceled 1 operations, will clean up. > [E 07/19/2011 12:01:32] bmi_recv_callback_fn: I/O error occurred > [E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0 error > cleanup finished: Operation cancelled (possibly due to timeout) > > > > > Thanks, > > Mi > > > > > > Becky > > > > On Tue, Jul 19, 2011 at 4:40 PM, Mi Zhou <[email protected]> wrote: > > Hi Becky, > > > > Now it does not time out but one of the pvfs2 server nodes > > crashed in > > the middle of the copy: > > > > [D 07/19/2011 15:31:54] ib_check_cq: recv from > > 172.20.101.34:45263 len > > 16 type MSG_RTS_DONE credit 1. > > [D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE > > mop_id > > 11957200. > > [E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id > > 11957200 > > in RTS_DONE message not found. > > [E 07/19/2011 15:31:54] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a] > > [E 07/19/2011 15:31:54] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > > [0x446f60] > > [E 07/19/2011 15:31:54] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > > [0x448aa9] > > [E 07/19/2011 15:31:54] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected > > +0x383) > > [0x445293] > > [E 07/19/2011 15:31:54] > > [bt] /opt/pvfs/pvfs/sbin/pvfs2-server > > [0x47b77c] > > [E 07/19/2011 15:31:54] [bt] /lib64/libpthread.so.0 > > [0x3ba820673d] > > [E 07/19/2011 15:31:54] [bt] /lib64/libc.so.6(clone > > +0x6d) > > [0x3ba7ad44bd] > > > > > > Thanks, > > > > Mi > > > > > > On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote: > > > Mi: > > > > > > In your configuration file set the following: > > > > > > <Defaults> > > > ServerJobFlowTimeoutSecs 600 > > > ClientJobFlowTimeoutSecs 600 > > > > > > ServerJobBMITimeoutSecs 600 > > > ClientJobBMITimeoutSecs 600 > > > </Defaults> > > > > > > Normally, these timeouts are 300 seconds (5 minutes). See > > if this > > > helps with the NFS-to-PVFS 75GB copy. > > > > > > I will also check into why pvfs2-cp issued that assert. > > Most likely, > > > the code isn't handling error conditions properly. > > > > > > Becky > > > > > > > > > > > > On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon > > <[email protected]> > > > wrote: > > > Mi: > > > > > > I believe you need to increase the job timer > > configuration > > > option. Give me a few minutes and I'll send you the > > exact > > > information. > > > > > > If you can avoid using NFS and copy directly from > > the physical > > > source, your copy will execute much quicker. > > > > > > Becky > > > > > > > > > > > > On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou > > <[email protected]> > > > wrote: > > > Hi, > > > > > > I tried to "pvfs-cp" a 75G file from NFS to > > PVFS, it > > > stalled at 1.5G and > > > after a while I got this error: > > > > > > [E 11:28:15.521435] job_time_mgr_expire: job > > time out: > > > cancelling flow > > > operation, job_id: 2020. > > > [E 11:28:15.521578] fp_multiqueue_cancel: > > flow proto > > > cancel called on > > > 0x1468bc78 > > > [E 11:28:15.521587] fp_multiqueue_cancel: > > I/O error > > > occurred > > > [E 11:28:15.521616] handle_io_error: flow > > proto error > > > cleanup started on > > > 0x1468bc78: Operation cancelled (possibly > > due to > > > timeout) > > > [E 11:28:15.521667] handle_io_error: flow > > proto > > > 0x1468bc78 canceled 1 > > > operations, will clean up. > > > [E 11:28:15.522059] mem_to_bmi_callback_fn: > > I/O error > > > occurred > > > [E 11:28:15.522124] handle_io_error: flow > > proto > > > 0x1468bc78 error cleanup > > > finished: Operation cancelled (possibly due > > to > > > timeout) > > > pvfs2-cp: src/client/sysint/sys-io.sm:1423: > > > io_datafile_complete_operations: Assertion > > > `cur_ctx->write_ack.recv_status.actual_size > > <= > > > cur_ctx->write_ack.max_resp_sz' failed. > > > Aborted > > > > > > > > > Any advice is very much appreciated. > > > > > > Thanks, > > > > > > > > > -- > > > > > > Mi Zhou > > > System Integration Engineer > > > Information Sciences > > > St. Jude Children's Research Hospital > > > 262 Danny Thomas Pl. MS 312 > > > Memphis, TN 38105 > > > 901.595.5771 > > > > > > > > > Email Disclaimer: > > www.stjude.org/emaildisclaimer > > > > > > > > > > > _______________________________________________ > > > Pvfs2-users mailing list > > > [email protected] > > > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > > > > > > > > > > -- > > > Becky Ligon > > > OrangeFS Support and Development > > > Omnibond Systems > > > Anderson, South Carolina > > > > > > > > > > > > > > > > > > -- > > > Becky Ligon > > > OrangeFS Support and Development > > > Omnibond Systems > > > Anderson, South Carolina > > > > > > > > > > -- > > > > > > Mi Zhou > > System Integration Engineer > > Information Sciences > > St. Jude Children's Research Hospital > > 262 Danny Thomas Pl. MS 312 > > Memphis, TN 38105 > > 901.595.5771 > > > > > > > > > > > > > > -- > > Becky Ligon > > OrangeFS Support and Development > > Omnibond Systems > > Anderson, South Carolina > > > > > -- > > Mi Zhou > System Integration Engineer > Information Sciences > St. Jude Children's Research Hospital > 262 Danny Thomas Pl. MS 312 > Memphis, TN 38105 > 901.595.5771 > > > -- Becky Ligon OrangeFS Support and Development Omnibond Systems Anderson, South Carolina
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
