I'm working on a set of patches for the IB support. There are several issues I'm working through on the patches before I commit them. I'll send you a copy when I have them ready for release so you can test them.
-Randy On 2/7/13 8:54 AM, "Yves Revaz" <[email protected]> wrote: >On 10/18/2012 11:41 PM, Kyle Schochenmaier wrote: >> Hi Yves - >> >> How frequently do you see these warnings? Does it cause any >> servers/clients to hang? > >Hi Kyle and the list, > >In a previous mail, I was mentioning the following errors: > >[E 02/07/2013 14:39:24] Warning: encourage_recv_incoming: mop_id d0e680 >in RTS_DONE message not found. >[E 02/07/2013 14:39:54] job_time_mgr_expire: job time out: cancelling >flow operation, job_id: 17549115350. >[E 02/07/2013 14:39:54] fp_multiqueue_cancel: flow proto cancel called >on 0x1bce5e0 >[E 02/07/2013 14:39:54] fp_multiqueue_cancel: I/O error occurred >[E 02/07/2013 14:39:54] handle_io_error: flow proto error cleanup >started on 0x1bce5e0: Operation cancelled (possibly due to timeout) >[E 02/07/2013 14:39:54] handle_io_error: flow proto 0x1bce5e0 canceled 1 >operations, will clean up. >[E 02/07/2013 14:39:54] bmi_recv_callback_fn: I/O error occurred >[E 02/07/2013 14:39:54] handle_io_error: flow proto 0x1bce5e0 error >cleanup finished: Operation cancelled (possibly due to timeout) > >In fact, I'm trying to move 10Tb of data in our pvfs, using and rsync. >When a lot of data are transfered, those errors occurs very frequently, >about every 5 minutes, which >is very annoying. > >I've checked our IB network which is perfectly sane. >I'm currently using orangefs-2.8.6/. Should I move to 2.8.7 ? >Looking at the changelog of the 2.8.7 realease, I don't thinks IB >related problems >have been fixed. > >Thanks, > >yves > > > > > > > > > > > > > > >> If not common/destructive this could be that there was a simple error >> case on the infiniband fabric and that the operation timed out in pvfs >> and that can be readily ignored as it would be retransmitted >> eventually. >> >> If you see this a lot it may be one of a few issues that we've fixed >> in recent releases, which version of orangefs/pvfs are you using? >> ~Kyle >> >> Kyle Schochenmaier >> >> >> On Thu, Oct 18, 2012 at 4:31 PM, Becky Ligon<[email protected]> wrote: >>> Yves: >>> >>> The timeouts that you listed below are in the configuration file. >>> >>> ClientJobBMITimeoutSecs 300 - The client's job scheduler limits each >>>"job" >>> sent across the network to this timeout. If the job exceeds this >>>limit, the >>> job is cancelled. Depending on the request, the job may be retried. >>>Keep >>> in mind that one PVFS request can be made up of many jobs. >>> >>> ClientJobFlowTimeoutSecs - This value limits the time spent on a >>>particular >>> job called a flow. A flow is used to transfer data across the network >>>to a >>> server or to transfer data from a server to the client. Again, if >>>the >>> flow exceeds this timeout, then the flow is cancelled. >>> >>> The server counterparts for these settings are rarely used, since the >>>server >>> doesn't normally initiate reads or writes. >>> >>> I think your real problem has something to do with IB, but I am not an >>> expert in that area. I have cc'd Kyle Schochenmaier to see if he can >>>help. >>> >>> Becky >>> >>> >>> >>> On Thu, Oct 18, 2012 at 4:07 PM, Yves Revaz<[email protected]> wrote: >>>> >>>> Dear list, >>>> >>>> I sometimes have the following error occuring in my pvfs server log. >>>> >>>> [E 10/18/2012 20:59:50] Warning: encourage_recv_incoming: mop_id >>>>150c320 >>>> in RTS_DONE message not found. >>>> [E 10/18/2012 21:00:50] job_time_mgr_expire: job time out: cancelling >>>>flow >>>> operation, job_id: 33307291. >>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: flow proto cancel >>>>called on >>>> 0xf18c80 >>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error occurred >>>> [E 10/18/2012 21:00:50] handle_io_error: flow proto error cleanup >>>>started >>>> on 0xf18c80: Operation cancelled (possibly due to timeout) >>>> [E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 canceled >>>>1 >>>> operations, will clean up. >>>> [E 10/18/2012 21:00:50] bmi_recv_callback_fn: I/O error occurred >>>> [E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 error >>>>cleanup >>>> finished: Operation cancelled (possibly due to time >>>> >>>> >>>> Looking at the mailing list, I've found that increasing these default >>>> value (300) >>>> >>>> ServerJobBMITimeoutSecs 30 >>>> ServerJobFlowTimeoutSecs 30 >>>> ClientJobBMITimeoutSecs 300 >>>> ClientJobFlowTimeoutSecs 300 >>>> >>>> to 600. >>>> >>>> What is at the origin of these timeout ? >>>> >>>> Thanks, >>>> >>>> >>>> yves >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> (o o) >>>> --------------------------------------------oOO--(_)--OOo------- >>>> Dr. Yves Revaz >>>> Laboratory of Astrophysics EPFL >>>> >>>> Observatoire de Sauverny Tel : ++ 41 22 379 24 28 >>>> 51. Ch. des Maillettes Fax : ++ 41 22 379 22 05 >>>> 1290 Sauverny e-mail : [email protected] >>>> SWITZERLAND Web : http://www.lunix.ch/revaz/ >>>> ---------------------------------------------------------------- >>>> >>>> _______________________________________________ >>>> Pvfs2-users mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >>> >>> >>> -- >>> Becky Ligon >>> OrangeFS Support and Development >>> Omnibond Systems >>> Anderson, South Carolina >>> >>> > > >-- > >---------------------------------------------------------------- > Dr. Yves Revaz > Laboratory of Astrophysics > Ecole Polytechnique Fédérale de Lausanne (EPFL) > Observatoire de Sauverny Tel : ++ 41 22 379 24 28 > 51. Ch. des Maillettes Fax : ++ 41 22 379 22 05 > 1290 Sauverny e-mail : [email protected] > SWITZERLAND Web : http://www.lunix.ch/revaz/ >---------------------------------------------------------------- > >_______________________________________________ >Pvfs2-users mailing list >[email protected] >http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
