Re: [Pvfs2-users] pvfs-cp failure

Mi Zhou Tue, 19 Jul 2011 13:42:39 -0700

Hi Becky,

Now it does not time out but one of the pvfs2 server nodes crashed in
the middle of the copy:


[D 07/19/2011 15:31:54] ib_check_cq: recv from 172.20.101.34:45263 len
16 type MSG_RTS_DONE credit 1.
[D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE mop_id
11957200.
[E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id 11957200
in RTS_DONE message not found.
[E 07/19/2011 15:31:54]
[bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a]
[E 07/19/2011 15:31:54]         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
[0x446f60]
[E 07/19/2011 15:31:54]         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
[0x448aa9]
[E 07/19/2011 15:31:54]
[bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected+0x383)
[0x445293]
[E 07/19/2011 15:31:54]         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
[0x47b77c]
[E 07/19/2011 15:31:54]         [bt] /lib64/libpthread.so.0
[0x3ba820673d]
[E 07/19/2011 15:31:54]         [bt] /lib64/libc.so.6(clone+0x6d)
[0x3ba7ad44bd]


Thanks,

Mi

On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote:
> Mi:
> 
> In your configuration file set the following:
>  
> <Defaults>
>     ServerJobFlowTimeoutSecs  600
>      ClientJobFlowTimeoutSecs    600
> 
>      ServerJobBMITimeoutSecs  600
>      ClientJobBMITimeoutSecs    600
> </Defaults>
> 
> Normally, these timeouts are 300 seconds (5 minutes).  See if this
> helps with the NFS-to-PVFS 75GB copy.
> 
> I will also check into why pvfs2-cp issued that assert.  Most likely,
> the code isn't handling error conditions properly.
> 
> Becky
> 
> 
> 
> On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon <[email protected]>
> wrote:
>         Mi:
>         
>         I believe you need to increase the job timer configuration
>         option.  Give me a few minutes and I'll send you the exact
>         information.
>         
>         If you can avoid using NFS and copy directly from the physical
>         source, your copy will execute much quicker.  
>         
>         Becky
>         
>         
>         
>         On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou <[email protected]>
>         wrote:
>                 Hi,
>                 
>                 I tried to "pvfs-cp" a 75G file from NFS to PVFS, it
>                 stalled at 1.5G and
>                 after a while I got this error:
>                 
>                 [E 11:28:15.521435] job_time_mgr_expire: job time out:
>                 cancelling flow
>                 operation, job_id: 2020.
>                 [E 11:28:15.521578] fp_multiqueue_cancel: flow proto
>                 cancel called on
>                 0x1468bc78
>                 [E 11:28:15.521587] fp_multiqueue_cancel: I/O error
>                 occurred
>                 [E 11:28:15.521616] handle_io_error: flow proto error
>                 cleanup started on
>                 0x1468bc78: Operation cancelled (possibly due to
>                 timeout)
>                 [E 11:28:15.521667] handle_io_error: flow proto
>                 0x1468bc78 canceled 1
>                 operations, will clean up.
>                 [E 11:28:15.522059] mem_to_bmi_callback_fn: I/O error
>                 occurred
>                 [E 11:28:15.522124] handle_io_error: flow proto
>                 0x1468bc78 error cleanup
>                 finished: Operation cancelled (possibly due to
>                 timeout)
>                 pvfs2-cp: src/client/sysint/sys-io.sm:1423:
>                 io_datafile_complete_operations: Assertion
>                 `cur_ctx->write_ack.recv_status.actual_size <=
>                 cur_ctx->write_ack.max_resp_sz' failed.
>                 Aborted
>                 
>                 
>                 Any advice is very much appreciated.
>                 
>                 Thanks,
>                 
>                 
>                 --
>                 
>                 Mi Zhou
>                 System Integration Engineer
>                 Information Sciences
>                 St. Jude Children's Research Hospital
>                 262 Danny Thomas Pl. MS 312
>                 Memphis, TN 38105
>                 901.595.5771
>                 
>                 
>                 Email Disclaimer:  www.stjude.org/emaildisclaimer
>                 
>                 
>                 _______________________________________________
>                 Pvfs2-users mailing list
>                 [email protected]
>                 
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>         
>         
>         
>         
>         -- 
>         Becky Ligon
>         OrangeFS Support and Development
>         Omnibond Systems
>         Anderson, South Carolina
>         
>         
> 
> 
> 
> -- 
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
> 
> 
-- 

Mi Zhou
System Integration Engineer
Information Sciences
St. Jude Children's Research Hospital
262 Danny Thomas Pl. MS 312 
Memphis, TN 38105
901.595.5771


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] pvfs-cp failure

Reply via email to