Re: [Pvfs2-users] pvfs-cp failure

Becky Ligon Tue, 19 Jul 2011 14:56:41 -0700

Thanks!

Becky


On Tue, Jul 19, 2011 at 5:35 PM, Mi Zhou <[email protected]> wrote:

>
> > Another question:  when pvfs2-cp failed and you got the timing
> > messages on the client, did you also get timing messages on any of the
> > servers at about the same time?
>
> This is errors on the server when I tried copying files. BTW, it happens
> not only from NFS to PVFS, but also local to/from PVFS, PVFS to/from
> PVFS.
>
> [E 07/19/2011 11:19:29] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 770776.
> [E 07/19/2011 11:19:29] fp_multiqueue_cancel: flow proto cancel called
> on 0x2aaaac008020
> [E 07/19/2011 11:19:29] fp_multiqueue_cancel: I/O error occurred
> [E 07/19/2011 11:19:29] handle_io_error: flow proto error cleanup
> started on 0x2aaaac008020: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020
> canceled 1 operations, will clean up.
> [E 07/19/2011 11:19:29] bmi_recv_callback_fn: I/O error occurred
> [E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020 error
> cleanup finished: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:31:40] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 772303.
> [E 07/19/2011 11:31:40] fp_multiqueue_cancel: flow proto cancel called
> on 0x2aaaac005ea0
> [E 07/19/2011 11:31:40] fp_multiqueue_cancel: I/O error occurred
> [E 07/19/2011 11:31:40] handle_io_error: flow proto error cleanup
> started on 0x2aaaac005ea0: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0
> canceled 1 operations, will clean up.
> [E 07/19/2011 11:31:40] bmi_recv_callback_fn: I/O error occurred
> [E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0 error
> cleanup finished: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:33:13] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 772529.
> [E 07/19/2011 11:33:13] fp_multiqueue_cancel: flow proto cancel called
> on 0x2aaaac2792a0
> [E 07/19/2011 11:33:13] fp_multiqueue_cancel: I/O error occurred
> [E 07/19/2011 11:33:13] handle_io_error: flow proto error cleanup
> started on 0x2aaaac2792a0: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0
> canceled 1 operations, will clean up.
> [E 07/19/2011 11:33:13] bmi_recv_callback_fn: I/O error occurred
> [E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0 error
> cleanup finished: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:47:29] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 774732.
> [E 07/19/2011 11:47:29] fp_multiqueue_cancel: flow proto cancel called
> on 0x2aaaac043410
> [E 07/19/2011 11:47:29] fp_multiqueue_cancel: I/O error occurred
> [E 07/19/2011 11:47:29] handle_io_error: flow proto error cleanup
> started on 0x2aaaac043410: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410
> canceled 1 operations, will clean up.
> [E 07/19/2011 11:47:29] bmi_recv_callback_fn: I/O error occurred
> [E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410 error
> cleanup finished: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:55:43] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 775375.
> [E 07/19/2011 11:55:43] fp_multiqueue_cancel: flow proto cancel called
> on 0x2aaaac279050
> [E 07/19/2011 11:55:43] fp_multiqueue_cancel: I/O error occurred
> [E 07/19/2011 11:55:43] handle_io_error: flow proto error cleanup
> started on 0x2aaaac279050: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050
> canceled 1 operations, will clean up.
> [E 07/19/2011 11:55:43] bmi_recv_callback_fn: I/O error occurred
> [E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050 error
> cleanup finished: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 12:01:32] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 778783.
> [E 07/19/2011 12:01:32] fp_multiqueue_cancel: flow proto cancel called
> on 0x2aaaac0070a0
> [E 07/19/2011 12:01:32] fp_multiqueue_cancel: I/O error occurred
> [E 07/19/2011 12:01:32] handle_io_error: flow proto error cleanup
> started on 0x2aaaac0070a0: Operation cancelled (possibly due to timeout)
> [E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0
> canceled 1 operations, will clean up.
> [E 07/19/2011 12:01:32] bmi_recv_callback_fn: I/O error occurred
> [E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0 error
> cleanup finished: Operation cancelled (possibly due to timeout)
>
>
>
>
> Thanks,
>
> Mi
>
>
> >
> > Becky
> >
> > On Tue, Jul 19, 2011 at 4:40 PM, Mi Zhou <[email protected]> wrote:
> >         Hi Becky,
> >
> >         Now it does not time out but one of the pvfs2 server nodes
> >         crashed in
> >         the middle of the copy:
> >
> >         [D 07/19/2011 15:31:54] ib_check_cq: recv from
> >         172.20.101.34:45263 len
> >         16 type MSG_RTS_DONE credit 1.
> >         [D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE
> >         mop_id
> >         11957200.
> >         [E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id
> >         11957200
> >         in RTS_DONE message not found.
> >         [E 07/19/2011 15:31:54]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a]
> >         [E 07/19/2011 15:31:54]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
> >         [0x446f60]
> >         [E 07/19/2011 15:31:54]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
> >         [0x448aa9]
> >         [E 07/19/2011 15:31:54]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected
> >         +0x383)
> >         [0x445293]
> >         [E 07/19/2011 15:31:54]
> >         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
> >         [0x47b77c]
> >         [E 07/19/2011 15:31:54]         [bt] /lib64/libpthread.so.0
> >         [0x3ba820673d]
> >         [E 07/19/2011 15:31:54]         [bt] /lib64/libc.so.6(clone
> >         +0x6d)
> >         [0x3ba7ad44bd]
> >
> >
> >         Thanks,
> >
> >         Mi
> >
> >
> >         On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote:
> >         > Mi:
> >         >
> >         > In your configuration file set the following:
> >         >
> >         > <Defaults>
> >         >     ServerJobFlowTimeoutSecs  600
> >         >      ClientJobFlowTimeoutSecs    600
> >         >
> >         >      ServerJobBMITimeoutSecs  600
> >         >      ClientJobBMITimeoutSecs    600
> >         > </Defaults>
> >         >
> >         > Normally, these timeouts are 300 seconds (5 minutes).  See
> >         if this
> >         > helps with the NFS-to-PVFS 75GB copy.
> >         >
> >         > I will also check into why pvfs2-cp issued that assert.
> >          Most likely,
> >         > the code isn't handling error conditions properly.
> >         >
> >         > Becky
> >         >
> >         >
> >         >
> >         > On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon
> >         <[email protected]>
> >         > wrote:
> >         >         Mi:
> >         >
> >         >         I believe you need to increase the job timer
> >         configuration
> >         >         option.  Give me a few minutes and I'll send you the
> >         exact
> >         >         information.
> >         >
> >         >         If you can avoid using NFS and copy directly from
> >         the physical
> >         >         source, your copy will execute much quicker.
> >         >
> >         >         Becky
> >         >
> >         >
> >         >
> >         >         On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou
> >         <[email protected]>
> >         >         wrote:
> >         >                 Hi,
> >         >
> >         >                 I tried to "pvfs-cp" a 75G file from NFS to
> >         PVFS, it
> >         >                 stalled at 1.5G and
> >         >                 after a while I got this error:
> >         >
> >         >                 [E 11:28:15.521435] job_time_mgr_expire: job
> >         time out:
> >         >                 cancelling flow
> >         >                 operation, job_id: 2020.
> >         >                 [E 11:28:15.521578] fp_multiqueue_cancel:
> >         flow proto
> >         >                 cancel called on
> >         >                 0x1468bc78
> >         >                 [E 11:28:15.521587] fp_multiqueue_cancel:
> >         I/O error
> >         >                 occurred
> >         >                 [E 11:28:15.521616] handle_io_error: flow
> >         proto error
> >         >                 cleanup started on
> >         >                 0x1468bc78: Operation cancelled (possibly
> >         due to
> >         >                 timeout)
> >         >                 [E 11:28:15.521667] handle_io_error: flow
> >         proto
> >         >                 0x1468bc78 canceled 1
> >         >                 operations, will clean up.
> >         >                 [E 11:28:15.522059] mem_to_bmi_callback_fn:
> >         I/O error
> >         >                 occurred
> >         >                 [E 11:28:15.522124] handle_io_error: flow
> >         proto
> >         >                 0x1468bc78 error cleanup
> >         >                 finished: Operation cancelled (possibly due
> >         to
> >         >                 timeout)
> >         >                 pvfs2-cp: src/client/sysint/sys-io.sm:1423:
> >         >                 io_datafile_complete_operations: Assertion
> >         >                 `cur_ctx->write_ack.recv_status.actual_size
> >         <=
> >         >                 cur_ctx->write_ack.max_resp_sz' failed.
> >         >                 Aborted
> >         >
> >         >
> >         >                 Any advice is very much appreciated.
> >         >
> >         >                 Thanks,
> >         >
> >         >
> >         >                 --
> >         >
> >         >                 Mi Zhou
> >         >                 System Integration Engineer
> >         >                 Information Sciences
> >         >                 St. Jude Children's Research Hospital
> >         >                 262 Danny Thomas Pl. MS 312
> >         >                 Memphis, TN 38105
> >         >                 901.595.5771
> >         >
> >         >
> >         >                 Email Disclaimer:
> >          www.stjude.org/emaildisclaimer
> >         >
> >         >
> >         >
> >         _______________________________________________
> >         >                 Pvfs2-users mailing list
> >         >                 [email protected]
> >         >
> >         http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >         >
> >         >
> >         >
> >         >
> >         >         --
> >         >         Becky Ligon
> >         >         OrangeFS Support and Development
> >         >         Omnibond Systems
> >         >         Anderson, South Carolina
> >         >
> >         >
> >         >
> >         >
> >         >
> >         > --
> >         > Becky Ligon
> >         > OrangeFS Support and Development
> >         > Omnibond Systems
> >         > Anderson, South Carolina
> >         >
> >         >
> >
> >         --
> >
> >
> >         Mi Zhou
> >         System Integration Engineer
> >         Information Sciences
> >         St. Jude Children's Research Hospital
> >         262 Danny Thomas Pl. MS 312
> >         Memphis, TN 38105
> >         901.595.5771
> >
> >
> >
> >
> >
> >
> > --
> > Becky Ligon
> > OrangeFS Support and Development
> > Omnibond Systems
> > Anderson, South Carolina
> >
> >
> --
>
> Mi Zhou
> System Integration Engineer
> Information Sciences
> St. Jude Children's Research Hospital
> 262 Danny Thomas Pl. MS 312
> Memphis, TN 38105
> 901.595.5771
>
>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] pvfs-cp failure

Reply via email to