Let me take a look at the BMI-ib module and see if I can figure out why you
are getting these error messages.

Another question:  when pvfs2-cp failed and you got the timing messages on
the client, did you also get timing messages on any of the servers at about
the same time?

Becky

On Tue, Jul 19, 2011 at 4:40 PM, Mi Zhou <[email protected]> wrote:

> Hi Becky,
>
> Now it does not time out but one of the pvfs2 server nodes crashed in
> the middle of the copy:
>
> [D 07/19/2011 15:31:54] ib_check_cq: recv from 172.20.101.34:45263 len
> 16 type MSG_RTS_DONE credit 1.
> [D 07/19/2011 15:31:54] encourage_recv_incoming: recv RTS_DONE mop_id
> 11957200.
> [E 07/19/2011 15:31:54] Error: encourage_recv_incoming: mop_id 11957200
> in RTS_DONE message not found.
> [E 07/19/2011 15:31:54]
> [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(error+0xca) [0x449d6a]
> [E 07/19/2011 15:31:54]         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
> [0x446f60]
> [E 07/19/2011 15:31:54]         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
> [0x448aa9]
> [E 07/19/2011 15:31:54]
> [bt] /opt/pvfs/pvfs/sbin/pvfs2-server(BMI_testunexpected+0x383)
> [0x445293]
> [E 07/19/2011 15:31:54]         [bt] /opt/pvfs/pvfs/sbin/pvfs2-server
> [0x47b77c]
> [E 07/19/2011 15:31:54]         [bt] /lib64/libpthread.so.0
> [0x3ba820673d]
> [E 07/19/2011 15:31:54]         [bt] /lib64/libc.so.6(clone+0x6d)
> [0x3ba7ad44bd]
>
>
> Thanks,
>
> Mi
>
> On Tue, 2011-07-19 at 12:47 -0500, Becky Ligon wrote:
> > Mi:
> >
> > In your configuration file set the following:
> >
> > <Defaults>
> >     ServerJobFlowTimeoutSecs  600
> >      ClientJobFlowTimeoutSecs    600
> >
> >      ServerJobBMITimeoutSecs  600
> >      ClientJobBMITimeoutSecs    600
> > </Defaults>
> >
> > Normally, these timeouts are 300 seconds (5 minutes).  See if this
> > helps with the NFS-to-PVFS 75GB copy.
> >
> > I will also check into why pvfs2-cp issued that assert.  Most likely,
> > the code isn't handling error conditions properly.
> >
> > Becky
> >
> >
> >
> > On Tue, Jul 19, 2011 at 1:33 PM, Becky Ligon <[email protected]>
> > wrote:
> >         Mi:
> >
> >         I believe you need to increase the job timer configuration
> >         option.  Give me a few minutes and I'll send you the exact
> >         information.
> >
> >         If you can avoid using NFS and copy directly from the physical
> >         source, your copy will execute much quicker.
> >
> >         Becky
> >
> >
> >
> >         On Tue, Jul 19, 2011 at 12:38 PM, Mi Zhou <[email protected]>
> >         wrote:
> >                 Hi,
> >
> >                 I tried to "pvfs-cp" a 75G file from NFS to PVFS, it
> >                 stalled at 1.5G and
> >                 after a while I got this error:
> >
> >                 [E 11:28:15.521435] job_time_mgr_expire: job time out:
> >                 cancelling flow
> >                 operation, job_id: 2020.
> >                 [E 11:28:15.521578] fp_multiqueue_cancel: flow proto
> >                 cancel called on
> >                 0x1468bc78
> >                 [E 11:28:15.521587] fp_multiqueue_cancel: I/O error
> >                 occurred
> >                 [E 11:28:15.521616] handle_io_error: flow proto error
> >                 cleanup started on
> >                 0x1468bc78: Operation cancelled (possibly due to
> >                 timeout)
> >                 [E 11:28:15.521667] handle_io_error: flow proto
> >                 0x1468bc78 canceled 1
> >                 operations, will clean up.
> >                 [E 11:28:15.522059] mem_to_bmi_callback_fn: I/O error
> >                 occurred
> >                 [E 11:28:15.522124] handle_io_error: flow proto
> >                 0x1468bc78 error cleanup
> >                 finished: Operation cancelled (possibly due to
> >                 timeout)
> >                 pvfs2-cp: src/client/sysint/sys-io.sm:1423:
> >                 io_datafile_complete_operations: Assertion
> >                 `cur_ctx->write_ack.recv_status.actual_size <=
> >                 cur_ctx->write_ack.max_resp_sz' failed.
> >                 Aborted
> >
> >
> >                 Any advice is very much appreciated.
> >
> >                 Thanks,
> >
> >
> >                 --
> >
> >                 Mi Zhou
> >                 System Integration Engineer
> >                 Information Sciences
> >                 St. Jude Children's Research Hospital
> >                 262 Danny Thomas Pl. MS 312
> >                 Memphis, TN 38105
> >                 901.595.5771
> >
> >
> >                 Email Disclaimer:  www.stjude.org/emaildisclaimer
> >
> >
> >                 _______________________________________________
> >                 Pvfs2-users mailing list
> >                 [email protected]
> >
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
> >
> >
> >
> >         --
> >         Becky Ligon
> >         OrangeFS Support and Development
> >         Omnibond Systems
> >         Anderson, South Carolina
> >
> >
> >
> >
> >
> > --
> > Becky Ligon
> > OrangeFS Support and Development
> > Omnibond Systems
> > Anderson, South Carolina
> >
> >
> --
>
> Mi Zhou
> System Integration Engineer
> Information Sciences
> St. Jude Children's Research Hospital
> 262 Danny Thomas Pl. MS 312
> Memphis, TN 38105
> 901.595.5771
>
>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to