Re: [Pvfs2-users] pvfs-cp failure

Mi Zhou Tue, 19 Jul 2011 14:37:00 -0700

> Another question:  when pvfs2-cp failed and you got the timing
> messages on the client, did you also get timing messages on any of the
> servers at about the same time?


This is errors on the server when I tried copying files. BTW, it happens
not only from NFS to PVFS, but also local to/from PVFS, PVFS to/from
PVFS.

[E 07/19/2011 11:19:29] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 770776.
[E 07/19/2011 11:19:29] fp_multiqueue_cancel: flow proto cancel called
on 0x2aaaac008020
[E 07/19/2011 11:19:29] fp_multiqueue_cancel: I/O error occurred
[E 07/19/2011 11:19:29] handle_io_error: flow proto error cleanup
started on 0x2aaaac008020: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020
canceled 1 operations, will clean up.
[E 07/19/2011 11:19:29] bmi_recv_callback_fn: I/O error occurred
[E 07/19/2011 11:19:29] handle_io_error: flow proto 0x2aaaac008020 error
cleanup finished: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:31:40] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 772303.
[E 07/19/2011 11:31:40] fp_multiqueue_cancel: flow proto cancel called
on 0x2aaaac005ea0
[E 07/19/2011 11:31:40] fp_multiqueue_cancel: I/O error occurred
[E 07/19/2011 11:31:40] handle_io_error: flow proto error cleanup
started on 0x2aaaac005ea0: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0
canceled 1 operations, will clean up.
[E 07/19/2011 11:31:40] bmi_recv_callback_fn: I/O error occurred
[E 07/19/2011 11:31:40] handle_io_error: flow proto 0x2aaaac005ea0 error
cleanup finished: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:33:13] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 772529.
[E 07/19/2011 11:33:13] fp_multiqueue_cancel: flow proto cancel called
on 0x2aaaac2792a0
[E 07/19/2011 11:33:13] fp_multiqueue_cancel: I/O error occurred
[E 07/19/2011 11:33:13] handle_io_error: flow proto error cleanup
started on 0x2aaaac2792a0: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0
canceled 1 operations, will clean up.
[E 07/19/2011 11:33:13] bmi_recv_callback_fn: I/O error occurred
[E 07/19/2011 11:33:13] handle_io_error: flow proto 0x2aaaac2792a0 error
cleanup finished: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:47:29] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 774732.
[E 07/19/2011 11:47:29] fp_multiqueue_cancel: flow proto cancel called
on 0x2aaaac043410
[E 07/19/2011 11:47:29] fp_multiqueue_cancel: I/O error occurred
[E 07/19/2011 11:47:29] handle_io_error: flow proto error cleanup
started on 0x2aaaac043410: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410
canceled 1 operations, will clean up.
[E 07/19/2011 11:47:29] bmi_recv_callback_fn: I/O error occurred
[E 07/19/2011 11:47:29] handle_io_error: flow proto 0x2aaaac043410 error
cleanup finished: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:55:43] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 775375.
[E 07/19/2011 11:55:43] fp_multiqueue_cancel: flow proto cancel called
on 0x2aaaac279050
[E 07/19/2011 11:55:43] fp_multiqueue_cancel: I/O error occurred
[E 07/19/2011 11:55:43] handle_io_error: flow proto error cleanup
started on 0x2aaaac279050: Operation cancelled (possibly due to timeout)
[E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050
canceled 1 operations, will clean up.
[E 07/19/2011 11:55:43] bmi_recv_callback_fn: I/O error occurred
[E 07/19/2011 11:55:43] handle_io_error: flow proto 0x2aaaac279050 error
cleanup finished: Operation cancelled (possibly due to timeout)
[E 07/19/2011 12:01:32] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 778783.
[E 07/19/2011 12:01:32] fp_multiqueue_cancel: flow proto cancel called
on 0x2aaaac0070a0
[E 07/19/2011 12:01:32] fp_multiqueue_cancel: I/O error occurred
[E 07/19/2011 12:01:32] handle_io_error: flow proto error cleanup
started on 0x2aaaac0070a0: Operation cancelled (possibly due to timeout)
[E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0
canceled 1 operations, will clean up.
[E 07/19/2011 12:01:32] bmi_recv_callback_fn: I/O error occurred
[E 07/19/2011 12:01:32] handle_io_error: flow proto 0x2aaaac0070a0 error
cleanup finished: Operation cancelled (possibly due to timeout)




Thanks,

Mi


> 
> Becky
> 
> On Tue, Jul 19, 2011 at 4:40 PM, Mi Zhou <[email protected]> wrote:
>         Hi Becky,
>         
>         Now it does not time out but one of the pvfs2 server nodes
>         crashed in
>         the middle of the copy:
>         
>         [D 07/19/2011 15:31:54] ib_check_cq: recv from
>         172.20.101.34:45263 len
>         16 type MSG_RTS_DONE credit 1.
>         [D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE
>         mop_id
>         11957200.
>         [E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id
>         11957200
>         in RTS_DONE message not found.
>         [E 07/19/2011 15:31:54]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a]
>         [E 07/19/2011 15:31:54]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x446f60]
>         [E 07/19/2011 15:31:54]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x448aa9]
>         [E 07/19/2011 15:31:54]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected
>         +0x383)
>         [0x445293]
>         [E 07/19/2011 15:31:54]
>         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
>         [0x47b77c]
>         [E 07/19/2011 15:31:54]         [bt] /lib64/libpthread.so.0
>         [0x3ba820673d]
>         [E 07/19/2011 15:31:54]         [bt] /lib64/libc.so.6(clone
>         +0x6d)
>         [0x3ba7ad44bd]
>         
>         
>         Thanks,
>         
>         Mi
>         
>         
>         On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote:
>         > Mi:
>         >
>         > In your configuration file set the following:
>         >
>         > <Defaults>
>         >     ServerJobFlowTimeoutSecs  600
>         >      ClientJobFlowTimeoutSecs    600
>         >
>         >      ServerJobBMITimeoutSecs  600
>         >      ClientJobBMITimeoutSecs    600
>         > </Defaults>
>         >
>         > Normally, these timeouts are 300 seconds (5 minutes).  See
>         if this
>         > helps with the NFS-to-PVFS 75GB copy.
>         >
>         > I will also check into why pvfs2-cp issued that assert.
>          Most likely,
>         > the code isn't handling error conditions properly.
>         >
>         > Becky
>         >
>         >
>         >
>         > On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon
>         <[email protected]>
>         > wrote:
>         >         Mi:
>         >
>         >         I believe you need to increase the job timer
>         configuration
>         >         option.  Give me a few minutes and I'll send you the
>         exact
>         >         information.
>         >
>         >         If you can avoid using NFS and copy directly from
>         the physical
>         >         source, your copy will execute much quicker.
>         >
>         >         Becky
>         >
>         >
>         >
>         >         On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou
>         <[email protected]>
>         >         wrote:
>         >                 Hi,
>         >
>         >                 I tried to "pvfs-cp" a 75G file from NFS to
>         PVFS, it
>         >                 stalled at 1.5G and
>         >                 after a while I got this error:
>         >
>         >                 [E 11:28:15.521435] job_time_mgr_expire: job
>         time out:
>         >                 cancelling flow
>         >                 operation, job_id: 2020.
>         >                 [E 11:28:15.521578] fp_multiqueue_cancel:
>         flow proto
>         >                 cancel called on
>         >                 0x1468bc78
>         >                 [E 11:28:15.521587] fp_multiqueue_cancel:
>         I/O error
>         >                 occurred
>         >                 [E 11:28:15.521616] handle_io_error: flow
>         proto error
>         >                 cleanup started on
>         >                 0x1468bc78: Operation cancelled (possibly
>         due to
>         >                 timeout)
>         >                 [E 11:28:15.521667] handle_io_error: flow
>         proto
>         >                 0x1468bc78 canceled 1
>         >                 operations, will clean up.
>         >                 [E 11:28:15.522059] mem_to_bmi_callback_fn:
>         I/O error
>         >                 occurred
>         >                 [E 11:28:15.522124] handle_io_error: flow
>         proto
>         >                 0x1468bc78 error cleanup
>         >                 finished: Operation cancelled (possibly due
>         to
>         >                 timeout)
>         >                 pvfs2-cp: src/client/sysint/sys-io.sm:1423:
>         >                 io_datafile_complete_operations: Assertion
>         >                 `cur_ctx->write_ack.recv_status.actual_size
>         <=
>         >                 cur_ctx->write_ack.max_resp_sz' failed.
>         >                 Aborted
>         >
>         >
>         >                 Any advice is very much appreciated.
>         >
>         >                 Thanks,
>         >
>         >
>         >                 --
>         >
>         >                 Mi Zhou
>         >                 System Integration Engineer
>         >                 Information Sciences
>         >                 St. Jude Children's Research Hospital
>         >                 262 Danny Thomas Pl. MS 312
>         >                 Memphis, TN 38105
>         >                 901.595.5771
>         >
>         >
>         >                 Email Disclaimer:
>          www.stjude.org/emaildisclaimer
>         >
>         >
>         >
>         _______________________________________________
>         >                 Pvfs2-users mailing list
>         >                 [email protected]
>         >
>         http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>         >
>         >
>         >
>         >
>         >         --
>         >         Becky Ligon
>         >         OrangeFS Support and Development
>         >         Omnibond Systems
>         >         Anderson, South Carolina
>         >
>         >
>         >
>         >
>         >
>         > --
>         > Becky Ligon
>         > OrangeFS Support and Development
>         > Omnibond Systems
>         > Anderson, South Carolina
>         >
>         >
>         
>         --
>         
>         
>         Mi Zhou
>         System Integration Engineer
>         Information Sciences
>         St. Jude Children's Research Hospital
>         262 Danny Thomas Pl. MS 312
>         Memphis, TN 38105
>         901.595.5771
>         
>         
>         
> 
> 
> 
> -- 
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
> 
> 
-- 

Mi Zhou
System Integration Engineer
Information Sciences
St. Jude Children's Research Hospital
262 Danny Thomas Pl. MS 312 
Memphis, TN 38105
901.595.5771


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] pvfs-cp failure

Reply via email to