I'm hoping no more than a few weeks max. -Rady
On 2/7/13 3:43 PM, "Yves Revaz" <[email protected]> wrote: >On 02/07/2013 05:29 PM, Randall Martin wrote: >> Thanks for the patch. I'll merge it in with my patch. > >By the way Randy, when do you expect to have patches ready ? >Is it a matter of days, of month ? Just to have a rough idea, > >Thanks in advance, > >yves > >> >> -Randy >> >> On 2/7/13 11:25 AM, "Michael Robbert" <[email protected]> wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Randy, >>> As long as you're working on IB patches. I just remembered that I had >>> to apply a patch before I could get 2.8.7 to build on my CentOS 5 >>> machines running their stock IB stack. >>> >>> - --- src/io/bmi/bmi_ib/openib.c.orig 2013-01-10 15:47:52.000000000 >>> - -0700 >>> +++ src/io/bmi/bmi_ib/openib.c 2013-01-10 15:37:59.000000000 -0700 >>> @@ -745,7 +745,9 @@ >>> #ifdef HAVE_IBV_EVENT_CLIENT_REREGISTER >>> CASE(IBV_EVENT_CLIENT_REREGISTER); >>> #endif >>> +#ifdef HAVE_IBV_EVENT_GID_CHANGE >>> CASE(IBV_EVENT_GID_CHANGE); >>> +#endif >>> } >>> return s; >>> } >>> >>> The issue was brought up in a thread on this list last summer, but I >>> never saw a final resolution and if there was one it apparently didn't >>> make it into 2.8.7 >>> >>> Thanks, >>> Mike Robbert >>> Colorado School of Mines >>> >>> On 2/7/13 8:20 AM, Randall Martin wrote: >>>> I'm working on a set of patches for the IB support. There are >>>> several issues I'm working through on the patches before I commit >>>> them. I'll send you a copy when I have them ready for release so >>>> you can test them. >>>> >>>> >>>> -Randy >>>> >>>> >>>> On 2/7/13 8:54 AM, "Yves Revaz" <[email protected]> wrote: >>>> >>>>> On 10/18/2012 11:41 PM, Kyle Schochenmaier wrote: >>>>>> Hi Yves - >>>>>> >>>>>> How frequently do you see these warnings? Does it cause any >>>>>> servers/clients to hang? >>>>> Hi Kyle and the list, >>>>> >>>>> In a previous mail, I was mentioning the following errors: >>>>> >>>>> [E 02/07/2013 14:39:24] Warning: encourage_recv_incoming: mop_id >>>>> d0e680 in RTS_DONE message not found. [E 02/07/2013 14:39:54] >>>>> job_time_mgr_expire: job time out: cancelling flow operation, >>>>> job_id: 17549115350. [E 02/07/2013 14:39:54] >>>>> fp_multiqueue_cancel: flow proto cancel called on 0x1bce5e0 [E >>>>> 02/07/2013 14:39:54] fp_multiqueue_cancel: I/O error occurred [E >>>>> 02/07/2013 14:39:54] handle_io_error: flow proto error cleanup >>>>> started on 0x1bce5e0: Operation cancelled (possibly due to >>>>> timeout) [E 02/07/2013 14:39:54] handle_io_error: flow proto >>>>> 0x1bce5e0 canceled 1 operations, will clean up. [E 02/07/2013 >>>>> 14:39:54] bmi_recv_callback_fn: I/O error occurred [E 02/07/2013 >>>>> 14:39:54] handle_io_error: flow proto 0x1bce5e0 error cleanup >>>>> finished: Operation cancelled (possibly due to timeout) >>>>> >>>>> In fact, I'm trying to move 10Tb of data in our pvfs, using and >>>>> rsync. When a lot of data are transfered, those errors occurs >>>>> very frequently, about every 5 minutes, which is very annoying. >>>>> >>>>> I've checked our IB network which is perfectly sane. I'm >>>>> currently using orangefs-2.8.6/. Should I move to 2.8.7 ? Looking >>>>> at the changelog of the 2.8.7 realease, I don't thinks IB related >>>>> problems have been fixed. >>>>> >>>>> Thanks, >>>>> >>>>> yves >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> If not common/destructive this could be that there was a simple >>>>>> error case on the infiniband fabric and that the operation >>>>>> timed out in pvfs and that can be readily ignored as it would >>>>>> be retransmitted eventually. >>>>>> >>>>>> If you see this a lot it may be one of a few issues that we've >>>>>> fixed in recent releases, which version of orangefs/pvfs are >>>>>> you using? ~Kyle >>>>>> >>>>>> Kyle Schochenmaier >>>>>> >>>>>> >>>>>> On Thu, Oct 18, 2012 at 4:31 PM, Becky >>>>>> Ligon<[email protected]> wrote: >>>>>>> Yves: >>>>>>> >>>>>>> The timeouts that you listed below are in the configuration >>>>>>> file. >>>>>>> >>>>>>> ClientJobBMITimeoutSecs 300 - The client's job scheduler >>>>>>> limits each "job" sent across the network to this timeout. >>>>>>> If the job exceeds this limit, the job is cancelled. >>>>>>> Depending on the request, the job may be retried. Keep in >>>>>>> mind that one PVFS request can be made up of many jobs. >>>>>>> >>>>>>> ClientJobFlowTimeoutSecs - This value limits the time spent >>>>>>> on a particular job called a flow. A flow is used to >>>>>>> transfer data across the network to a server or to transfer >>>>>>> data from a server to the client. Again, if the flow >>>>>>> exceeds this timeout, then the flow is cancelled. >>>>>>> >>>>>>> The server counterparts for these settings are rarely used, >>>>>>> since the server doesn't normally initiate reads or writes. >>>>>>> >>>>>>> I think your real problem has something to do with IB, but I >>>>>>> am not an expert in that area. I have cc'd Kyle >>>>>>> Schochenmaier to see if he can help. >>>>>>> >>>>>>> Becky >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Oct 18, 2012 at 4:07 PM, Yves >>>>>>> Revaz<[email protected]> wrote: >>>>>>>> Dear list, >>>>>>>> >>>>>>>> I sometimes have the following error occuring in my pvfs >>>>>>>> server log. >>>>>>>> >>>>>>>> [E 10/18/2012 20:59:50] Warning: encourage_recv_incoming: >>>>>>>> mop_id 150c320 in RTS_DONE message not found. [E 10/18/2012 >>>>>>>> 21:00:50] job_time_mgr_expire: job time out: cancelling >>>>>>>> flow operation, job_id: 33307291. [E 10/18/2012 21:00:50] >>>>>>>> fp_multiqueue_cancel: flow proto cancel called on 0xf18c80 >>>>>>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error >>>>>>>> occurred [E 10/18/2012 21:00:50] handle_io_error: flow >>>>>>>> proto error cleanup started on 0xf18c80: Operation >>>>>>>> cancelled (possibly due to timeout) [E 10/18/2012 21:00:50] >>>>>>>> handle_io_error: flow proto 0xf18c80 canceled 1 operations, >>>>>>>> will clean up. [E 10/18/2012 21:00:50] >>>>>>>> bmi_recv_callback_fn: I/O error occurred [E 10/18/2012 >>>>>>>> 21:00:50] handle_io_error: flow proto 0xf18c80 error >>>>>>>> cleanup finished: Operation cancelled (possibly due to >>>>>>>> time >>>>>>>> >>>>>>>> >>>>>>>> Looking at the mailing list, I've found that increasing >>>>>>>> these default value (300) >>>>>>>> >>>>>>>> ServerJobBMITimeoutSecs 30 ServerJobFlowTimeoutSecs 30 >>>>>>>> ClientJobBMITimeoutSecs 300 ClientJobFlowTimeoutSecs 300 >>>>>>>> >>>>>>>> to 600. >>>>>>>> >>>>>>>> What is at the origin of these timeout ? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> >>>>>>>> yves >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- (o o) >>>>>>>> --------------------------------------------oOO--(_)--OOo------- >>>>>>>> >>>>>>>> >>> Dr. Yves Revaz >>>>>>>> Laboratory of Astrophysics EPFL >>>>>>>> >>>>>>>> Observatoire de Sauverny Tel : ++ 41 22 379 24 28 51. >>>>>>>> Ch. des Maillettes Fax : ++ 41 22 379 22 05 1290 >>>>>>>> Sauverny e-mail : [email protected] >>>>>>>> SWITZERLAND Web : >>>>>>>> http://www.lunix.ch/revaz/ >>>>>>>> ---------------------------------------------------------------- >>>>>>>> >>>>>>>> >>>>>>>> >>> _______________________________________________ >>>>>>>> Pvfs2-users mailing list >>>>>>>> [email protected] >>>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>>>>> >>>>>>> >>>>>>> >>> - -- >>>>>>> Becky Ligon OrangeFS Support and Development Omnibond >>>>>>> Systems Anderson, South Carolina >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> >>>>> ---------------------------------------------------------------- >>>>> Dr. Yves Revaz Laboratory of Astrophysics Ecole Polytechnique >>>>> F←d←rale de Lausanne (EPFL) Observatoire de Sauverny Tel : ++ >>>>> 41 22 379 24 28 51. Ch. des Maillettes Fax : ++ 41 22 379 >>>>> 22 05 1290 Sauverny e-mail : [email protected] >>>>> SWITZERLAND Web : http://www.lunix.ch/revaz/ >>>>> ---------------------------------------------------------------- >>>>> >>>>> _______________________________________________ Pvfs2-users >>>>> mailing list [email protected] >>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>>> >>>> _______________________________________________ Pvfs2-users mailing >>>> list [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin) >>> Comment: GPGTools - http://gpgtools.org >>> >>> iQEcBAEBAgAGBQJRE9VcAAoJEFmgPOBxQDtBEYMIAJtgo1LMWxVtyWPa2PNvWr2c >>> NMUw30GNJ2llhwJVdefpmNqPLdou0Sqr7moAPseA2qYBguER1jqSH0rnXg7yE5TX >>> CNERJwaL4+99y+tRsvKukrEvegrS/CQ5tUPsiuFaqqcTlQRGYeGPtqJV3JuAsEa2 >>> bu49sN7yWFtM2fY0ZaFa2ouya6PR2mFAdH0ZnpcWr4OTY1Uf4py8njWvvWrMCB/2 >>> I3//H5RoOxhCBIe85RCdXbMh4LMQbwBeTYFePlutE7YplbrQwDLg/K4/ctswRl3T >>> oKpRy5GJ83LJQomhwWWjAAnWWXe6zNlbiGe/B5APrlgZfV960shxFPeWwej3EEk= >>> =iXn7 >>> -----END PGP SIGNATURE----- >>> _______________________________________________ >>> Pvfs2-users mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >> >> _______________________________________________ >> Pvfs2-users mailing list >> [email protected] >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > >-- > (o o) >--------------------------------------------oOO--(_)--OOo------- > Dr. Yves Revaz > Laboratory of Astrophysics EPFL > Observatoire de Sauverny Tel : ++ 41 22 379 24 28 > 51. Ch. des Maillettes Fax : ++ 41 22 379 22 05 > 1290 Sauverny e-mail : [email protected] > SWITZERLAND Web : http://www.lunix.ch/revaz/ >---------------------------------------------------------------- > >_______________________________________________ >Pvfs2-users mailing list >[email protected] >http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
