I'm hoping no more than a few weeks max.

-Rady

On 2/7/13 3:43 PM, "Yves Revaz" <[email protected]> wrote:

>On 02/07/2013 05:29 PM, Randall Martin wrote:
>> Thanks for the patch.  I'll merge it in with my patch.
>
>By the way Randy, when do you expect to have patches ready ?
>Is it a matter of days, of month ? Just to have a rough idea,
>
>Thanks in advance,
>
>yves
>
>>
>> -Randy
>>
>> On 2/7/13 11:25 AM, "Michael Robbert" <[email protected]> wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Randy,
>>> As long as you're working on IB patches. I just remembered that I had
>>> to apply a patch before I could get 2.8.7 to build on my CentOS 5
>>> machines running their stock IB stack.
>>>
>>> - --- src/io/bmi/bmi_ib/openib.c.orig     2013-01-10 15:47:52.000000000
>>> - -0700
>>> +++ src/io/bmi/bmi_ib/openib.c  2013-01-10 15:37:59.000000000 -0700
>>> @@ -745,7 +745,9 @@
>>> #ifdef HAVE_IBV_EVENT_CLIENT_REREGISTER
>>>         CASE(IBV_EVENT_CLIENT_REREGISTER);
>>> #endif
>>> +#ifdef HAVE_IBV_EVENT_GID_CHANGE
>>>         CASE(IBV_EVENT_GID_CHANGE);
>>> +#endif
>>>      }
>>>      return s;
>>> }
>>>
>>> The issue was brought up in a thread on this list last summer, but I
>>> never saw a final resolution and if there was one it apparently didn't
>>> make it into 2.8.7
>>>
>>> Thanks,
>>> Mike Robbert
>>> Colorado School of Mines
>>>
>>> On 2/7/13 8:20 AM, Randall Martin wrote:
>>>> I'm working on a set of patches for the IB support.  There are
>>>> several issues I'm working through on the patches before I commit
>>>> them.  I'll send you a copy when I have them ready for release so
>>>> you can test them.
>>>>
>>>>
>>>> -Randy
>>>>
>>>>
>>>> On 2/7/13 8:54 AM, "Yves Revaz" <[email protected]> wrote:
>>>>
>>>>> On 10/18/2012 11:41 PM, Kyle Schochenmaier wrote:
>>>>>> Hi Yves -
>>>>>>
>>>>>> How frequently do you see these warnings?  Does it cause any
>>>>>> servers/clients to hang?
>>>>> Hi Kyle and the list,
>>>>>
>>>>> In a previous mail, I was mentioning the following errors:
>>>>>
>>>>> [E 02/07/2013 14:39:24] Warning: encourage_recv_incoming: mop_id
>>>>> d0e680 in RTS_DONE message not found. [E 02/07/2013 14:39:54]
>>>>> job_time_mgr_expire: job time out: cancelling flow operation,
>>>>> job_id: 17549115350. [E 02/07/2013 14:39:54]
>>>>> fp_multiqueue_cancel: flow proto cancel called on 0x1bce5e0 [E
>>>>> 02/07/2013 14:39:54] fp_multiqueue_cancel: I/O error occurred [E
>>>>> 02/07/2013 14:39:54] handle_io_error: flow proto error cleanup
>>>>> started on 0x1bce5e0: Operation cancelled (possibly due to
>>>>> timeout) [E 02/07/2013 14:39:54] handle_io_error: flow proto
>>>>> 0x1bce5e0 canceled 1 operations, will clean up. [E 02/07/2013
>>>>> 14:39:54] bmi_recv_callback_fn: I/O error occurred [E 02/07/2013
>>>>> 14:39:54] handle_io_error: flow proto 0x1bce5e0 error cleanup
>>>>> finished: Operation cancelled (possibly due to timeout)
>>>>>
>>>>> In fact, I'm trying to move 10Tb of data in our pvfs, using and
>>>>> rsync. When a lot of data are transfered, those errors occurs
>>>>> very frequently, about every 5 minutes, which is very annoying.
>>>>>
>>>>> I've checked our IB network which is perfectly sane. I'm
>>>>> currently using orangefs-2.8.6/. Should I move to 2.8.7 ? Looking
>>>>> at the changelog of the 2.8.7 realease, I don't thinks IB related
>>>>> problems have been fixed.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> yves
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> If not common/destructive this could be that there was a simple
>>>>>> error case on the infiniband fabric and that the operation
>>>>>> timed out in pvfs and that can be readily ignored as it would
>>>>>> be retransmitted eventually.
>>>>>>
>>>>>> If you see this a lot it may be one of a few issues that we've
>>>>>> fixed in recent releases, which version of orangefs/pvfs are
>>>>>> you using? ~Kyle
>>>>>>
>>>>>> Kyle Schochenmaier
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 18, 2012 at 4:31 PM, Becky
>>>>>> Ligon<[email protected]>  wrote:
>>>>>>> Yves:
>>>>>>>
>>>>>>> The timeouts that you listed below are in the configuration
>>>>>>> file.
>>>>>>>
>>>>>>> ClientJobBMITimeoutSecs 300 - The client's job scheduler
>>>>>>> limits each "job" sent across the network to this timeout.
>>>>>>> If the job exceeds this limit, the job is cancelled.
>>>>>>> Depending on the request, the job may be retried. Keep in
>>>>>>> mind that one PVFS request can be made up of many jobs.
>>>>>>>
>>>>>>> ClientJobFlowTimeoutSecs - This value limits the time spent
>>>>>>> on a particular job called a flow.  A flow is used to
>>>>>>> transfer data across the network to a server or to transfer
>>>>>>> data from a server to the client.    Again, if the flow
>>>>>>> exceeds this timeout, then the flow is cancelled.
>>>>>>>
>>>>>>> The server counterparts for these settings are rarely used,
>>>>>>> since the server doesn't normally initiate reads or writes.
>>>>>>>
>>>>>>> I think your real problem has something to do with IB, but I
>>>>>>> am not an expert in that area.  I have cc'd Kyle
>>>>>>> Schochenmaier to see if he can help.
>>>>>>>
>>>>>>> Becky
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 18, 2012 at 4:07 PM, Yves
>>>>>>> Revaz<[email protected]>  wrote:
>>>>>>>> Dear list,
>>>>>>>>
>>>>>>>> I sometimes have the following error occuring in my pvfs
>>>>>>>> server log.
>>>>>>>>
>>>>>>>> [E 10/18/2012 20:59:50] Warning: encourage_recv_incoming:
>>>>>>>> mop_id 150c320 in RTS_DONE message not found. [E 10/18/2012
>>>>>>>> 21:00:50] job_time_mgr_expire: job time out: cancelling
>>>>>>>> flow operation, job_id: 33307291. [E 10/18/2012 21:00:50]
>>>>>>>> fp_multiqueue_cancel: flow proto cancel called on 0xf18c80
>>>>>>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error
>>>>>>>> occurred [E 10/18/2012 21:00:50] handle_io_error: flow
>>>>>>>> proto error cleanup started on 0xf18c80: Operation
>>>>>>>> cancelled (possibly due to timeout) [E 10/18/2012 21:00:50]
>>>>>>>> handle_io_error: flow proto 0xf18c80 canceled 1 operations,
>>>>>>>> will clean up. [E 10/18/2012 21:00:50]
>>>>>>>> bmi_recv_callback_fn: I/O error occurred [E 10/18/2012
>>>>>>>> 21:00:50] handle_io_error: flow proto 0xf18c80 error
>>>>>>>> cleanup finished: Operation cancelled (possibly due to
>>>>>>>> time
>>>>>>>>
>>>>>>>>
>>>>>>>> Looking at the mailing list, I've found that increasing
>>>>>>>> these default value (300)
>>>>>>>>
>>>>>>>> ServerJobBMITimeoutSecs 30 ServerJobFlowTimeoutSecs 30
>>>>>>>> ClientJobBMITimeoutSecs 300 ClientJobFlowTimeoutSecs 300
>>>>>>>>
>>>>>>>> to 600.
>>>>>>>>
>>>>>>>> What is at the origin of these  timeout ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> yves
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- (o o)
>>>>>>>> --------------------------------------------oOO--(_)--OOo-------
>>>>>>>>
>>>>>>>>
>>> Dr. Yves Revaz
>>>>>>>> Laboratory of Astrophysics EPFL
>>>>>>>>
>>>>>>>> Observatoire de Sauverny     Tel : ++ 41 22 379 24 28 51.
>>>>>>>> Ch. des Maillettes       Fax : ++ 41 22 379 22 05 1290
>>>>>>>> Sauverny             e-mail : [email protected]
>>>>>>>> SWITZERLAND                  Web :
>>>>>>>> http://www.lunix.ch/revaz/
>>>>>>>> ----------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>> _______________________________________________
>>>>>>>> Pvfs2-users mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>>>>
>>>>>>>
>>>>>>>
>>> - --
>>>>>>> Becky Ligon OrangeFS Support and Development Omnibond
>>>>>>> Systems Anderson, South Carolina
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ----------------------------------------------------------------
>>>>> Dr. Yves Revaz Laboratory of Astrophysics Ecole Polytechnique
>>>>> F←d←rale de Lausanne (EPFL) Observatoire de Sauverny     Tel : ++
>>>>> 41 22 379 24 28 51. Ch. des Maillettes       Fax : ++ 41 22 379
>>>>> 22 05 1290 Sauverny             e-mail : [email protected]
>>>>> SWITZERLAND                  Web : http://www.lunix.ch/revaz/
>>>>> ----------------------------------------------------------------
>>>>>
>>>>> _______________________________________________ Pvfs2-users
>>>>> mailing list [email protected]
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>>
>>>> _______________________________________________ Pvfs2-users mailing
>>>> list [email protected]
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>>> Comment: GPGTools - http://gpgtools.org
>>>
>>> iQEcBAEBAgAGBQJRE9VcAAoJEFmgPOBxQDtBEYMIAJtgo1LMWxVtyWPa2PNvWr2c
>>> NMUw30GNJ2llhwJVdefpmNqPLdou0Sqr7moAPseA2qYBguER1jqSH0rnXg7yE5TX
>>> CNERJwaL4+99y+tRsvKukrEvegrS/CQ5tUPsiuFaqqcTlQRGYeGPtqJV3JuAsEa2
>>> bu49sN7yWFtM2fY0ZaFa2ouya6PR2mFAdH0ZnpcWr4OTY1Uf4py8njWvvWrMCB/2
>>> I3//H5RoOxhCBIe85RCdXbMh4LMQbwBeTYFePlutE7YplbrQwDLg/K4/ctswRl3T
>>> oKpRy5GJ83LJQomhwWWjAAnWWXe6zNlbiGe/B5APrlgZfV960shxFPeWwej3EEk=
>>> =iXn7
>>> -----END PGP SIGNATURE-----
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>> _______________________________________________
>> Pvfs2-users mailing list
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>
>-- 
>                                                  (o o)
>--------------------------------------------oOO--(_)--OOo-------
>   Dr. Yves Revaz
>   Laboratory of Astrophysics EPFL
>   Observatoire de Sauverny     Tel : ++ 41 22 379 24 28
>   51. Ch. des Maillettes       Fax : ++ 41 22 379 22 05
>   1290 Sauverny             e-mail : [email protected]
>   SWITZERLAND                  Web : http://www.lunix.ch/revaz/
>----------------------------------------------------------------
>
>_______________________________________________
>Pvfs2-users mailing list
>[email protected]
>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users



_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to